#Lab.04 / IBM3202 – Comparative Modeling using MODELLER

##Theoretical Aspects

As we discussed during our Lectures, **the structure space is rather small when compared with the sequence space**. In fact, hierarchical structure classifications such as **CATH** have demonstrated that only a few proteins of the enormous pool of structures being deposited each year in the Protein Data Bank are known to provide novel folds. In this regard, it seems that we have discovered almost all protein folds. Moreover, the elegant experiment by Chothia & Lesk [Chothia C & Lesk AM (1986) _EMBO J_ 5(4), 823–826] revealed that, generally, **protein structures are quite similar between proteins when their sequence identity is > 30-40%**, with some remarkable exceptions to this rule).

<figure>
<center>
<img src='https://raw.githubusercontent.com/pb3lab/ibm3202/master/images/cm_01.png'/>
<figcaption>FIGURE 1. Relationship between sequence identity and structural similarity for globular and membrane proteins.
<br>Olivella M et al (2013) <i>Bioinformatics 29(13), 1589-1592</i></figcaption></center>
</figure>

This evidence provides the perfect framework to model the structure of a given protein based on another protein (a template) for which its structure has been solved and that has sufficient sequence identity. But how does it work? Well, basically it operates by using a lot of distance constraints.

Similarly to what we saw for coevolutionary signals in our Lectures, in which pairs of residues are conserved regardless of their sequence separation because of their role in stabilizing the native state through interactions in the 3D space, we can add interaction constraints to a sequence of a protein with unknown structures based on what we observe on the template structure.

<figure>
<center>
<img src='https://raw.githubusercontent.com/pb3lab/ibm3202/master/images/cm_02.png'/>
<figcaption>FIGURE 2. From sequence similarity to spatial constraints for comparative modelling in MODELLER.
<br>Fiser A & Sali A (2003) <i>Methods Enzymol 374, 461-491</i></figcaption></center>
</figure>

##Overview

in this tutorial, we will pursue the template-based modeling of an enzyme homologous to _I. sakaiensis_ PET hydrolase (PETase) [Fecker T et al (2018) *Biophys J 114 (6), 1302-1312*] on the software **MODELLER** using a single-template approach (i.e. based on only one protein structure). Then, we will analyze the model and determine its quality using an online server.

#Part 0 – Downloading and Installing the required software

Before we start, you must first **remember to start the hosted runtime in Google Colab**.

Then, we must install several pieces of software to perform this tutorial. Namely:
- **MODELLER**, a famous software for template-based modelling.
- **py3Dmol**, for visualization of the template and modelled protein structures.
- **biopython**, for manipulation and retrieval of protein structures and sequences.

After several tests, the following installation instructions are the best way of setting up **Google Colab** for this laboratory session.

1. We will first install MODELLER as follows:

⚠️**WARNING!:** In order to use MODELLER, you will need to obtain an Academic License by registering **[in this website](https://salilab.org/modeller/registration.html)**. The license key will be immediately sent to your email address.

In [1]:
#Before running this script, make sure to replace the MODELLER
#License Key with the one sent after registration in the MODELLER website
!wget https://salilab.org/modeller/10.4/modeller-10.4.tar.gz
#Then, we extract the downloaded folder containing MODELLER 10.1
!tar -zxf modeller-10.4.tar.gz
!echo "MODELLER extraction completed"
#Then, we switch onto the MODELLER folder
#with an automagic command
%cd modeller-10.4
#And we prepare a file containing the minimal setup elements
#For installing, including a license key
with open('modeller_config', 'a') as f:
  f.write("2\n")
  f.write("/content/compiled/MODELLER\n")
#ADD YOUR LICENSE KEY HERE!
  f.write("MODELIRANJE\n")
!./Install < modeller_config
!echo "MODELLER set up completed"
%cd /content/

--2024-10-02 22:54:38--  https://salilab.org/modeller/10.4/modeller-10.4.tar.gz
Resolving salilab.org (salilab.org)... 169.230.79.19
Connecting to salilab.org (salilab.org)|169.230.79.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38244158 (36M) [application/x-gzip]
Saving to: ‘modeller-10.4.tar.gz’


2024-10-02 22:54:40 (25.2 MB/s) - ‘modeller-10.4.tar.gz’ saved [38244158/38244158]

MODELLER extraction completed
/content/modeller-10.4
[H[2JInstallation of MODELLER 10.4

This script will install MODELLER 10.4 into a specified directory
for which you have read/write permissions.

To accept the default answers indicated in [...], press <Enter> only.

------------------------------------------------------------------------

The currently supported architectures are as follows:

   1) Linux x86 PC (e.g. RedHat, SuSe).
   2) x86_64 (Opteron/EM64T) box (Linux).
   3) Alternative x86 Linux binary (e.g. for FreeBSD).
   4) Linux on 32-bit ARM (e.g. for Raspberry

In [2]:
#Creating a symbolic link
%cd modeller-10.4
!ln -sf /content/compiled/MODELLER/bin/mod10.4 /usr/bin/
%cd /content/
#Checking if MODELLER works
!mod10.4 | awk 'NR==1{if($1=="usage:") print "MODELLER succesfully installed"; else if($1!="usage:") print "Something went wrong. Please install again"}'

/content/modeller-10.4
/content
MODELLER succesfully installed


2. Then, we will install biopython

In [3]:
#Installing biopython using pip
!pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/3.2 MB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/3.2 MB[0m [31m10.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.2/3.2 MB[0m [31m31.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


3. Finally, we will install py3Dmol

In [4]:
#Installing py3Dmol using pip
!pip install py3Dmol
#And importing the py3Dmol module
import py3Dmol

Collecting py3Dmol
  Downloading py3Dmol-2.4.0-py2.py3-none-any.whl.metadata (1.9 kB)
Downloading py3Dmol-2.4.0-py2.py3-none-any.whl (7.0 kB)
Installing collected packages: py3Dmol
Successfully installed py3Dmol-2.4.0


#Part	I – Retrieving the amino acid sequence of PETase homologs

Typically, this step starts with a **BLAST search** (or other types of database search) for sequence homologs of a given protein of interests, taking into consideration the e-values, sequence coverages and identities.

Here, we skipped the BLAST search that we performed during our previous tutorial and will just download one sequence.

1. Download the sequence with accession ID BAB86909.1 using biopython as we did in previous tutorials:

In [5]:
import os
from pathlib import Path
from Bio import SeqIO, Entrez
seqlist = ['6ANE_A', 'BAB86909.1']
for n in seqlist:
  #Creating folder for our sequence
  if not os.path.exists(n):
    os.mkdir(n)
  folder = Path(n)
  #Setting up your email to be able to use Entrez
  Entrez.email = 'your.email@uc.cl'
  #Here, we set up a temporary handle with our downloaded sequence in fasta format
  temp = Entrez.efetch(db="protein",rettype="fasta",id=n)
  #Creating a fasta file to write our downloaded sequence
  aaseq_out = open(folder/ "target.fasta",'w')
  #Reading the sequence information as a string in fasta format
  aaseq = SeqIO.read(temp, format="fasta")
  #Writing the sequence record in fasta format
  SeqIO.write(aaseq,aaseq_out,"fasta")
  #Closing both the temp handle and the FASTA file
  temp.close()
  aaseq_out.close()

2. If everything worked smoothly, your sequence should have been downloaded as a FASTA file in an appropriate folder named after its accession ID. Could you check what is the description and sequence of this file?

**💡 HINT:** You will only retrieve the information for the last query in the list, as you are overwriting *aaseq*

In [6]:
#Obtain the description of the downloaded protein
print(aaseq.description)
#Obtain the sequence of the downloaded protein
print(aaseq.seq)

BAB86909.1 PBS(A) depolymerase [Acidovorax delafieldii]
MHLPRSRWDIPFKEETTMTHHFSVRALLAAGALLASAAVSAQTNPYERGPAPTTSSLEASRGPFSYQSFTVSRPSGYRAGTVYYPTNAGGPVGAIAIVPGFTARQSSINWWGPRLASHGFVVITIDTNSTLDQPDSRSRQQMAALSQVATLSRTSSSPIYNKVDTSRLGVMGWSMGGGGSLISARNNPSIKAAAPQAPWSASKNFSSLTVPTLIIACENDTIAPVNQHADTFYDSMSRNPREFLEINNGSHSCANSGNSNQALLGKKGVAWMKRFMDNDRRYTSFACSNPNSYNVSDFRVAACN


3. It is highly recommended at this point that you change the name of the sequence (labeled as “>”) to something shorter, such as just the ID or a name (e.g. target). You can simply do this manually by opening this file on Google Colab. **For this tutorial, we are indeed changing the name to target**

#Part	II – Select an appropriate template structure and perform a sequence alignment for protein structure modeling

Selecting an appropriate template for modeling a structure of a homologous protein is as crucial as an appropriate alignment to correctly position the different residues.

**QUESTION:** What features of a crystal structure do you think are important for choosing the best template?

1. As we will  work on the sequence BAB86909.1, we will first change directory.

In [7]:
%cd BAB86909.1

/content/BAB86909.1


2. We will find the potential best templates from the whole PDB database to model the structure of our target protein using the **profile.build()** command from MODELLER.

  For this purpose, we need a text file containing a list of non-redundant PDB sequences at 95% sequence identity and an appropriate script for running MODELLER.

In [8]:
#Downloading pdb_95.pir
!wget https://salilab.org/modeller/downloads/pdb_95.pir.gz
!gunzip pdb_95.pir.gz
#Downloading the build_profile.py script from GitHub
!wget https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/build_profile.py

--2024-10-02 22:56:19--  https://salilab.org/modeller/downloads/pdb_95.pir.gz
Resolving salilab.org (salilab.org)... 169.230.79.19
Connecting to salilab.org (salilab.org)|169.230.79.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19280601 (18M) [application/x-gzip]
Saving to: ‘pdb_95.pir.gz’


2024-10-02 22:56:21 (14.0 MB/s) - ‘pdb_95.pir.gz’ saved [19280601/19280601]

--2024-10-02 22:56:22--  https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/build_profile.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1321 (1.3K) [text/plain]
Saving to: ‘build_profile.py’


2024-10-02 22:56:22 (44.7 MB/s) - ‘build_profile.py’ saved [1321/1321]



In [9]:
#Running the build_profile script
!mod10.4 build_profile.py
#Printing only the list of potential templates
!sed -n '/HITS FOUND IN ITERATION:     1/,/Weight Matrix/p;/Weight Matrix/q' build_profile.log

'import site' failed; use -v for traceback
HITS FOUND IN ITERATION:     1


Dynamically allocated memory at     amaxprofile [B,KiB,MiB]:      1185241    1157.462     1.130
> 7eoaA                      1     448   27850     259     304   48.21     0.0           2   248    43   299     1   251
> 7ykqA                      1    1489   31850     259     304   53.75     0.0           3   248    44   299     1   253
> 8d1dA                      1    5220   55050     262     304   80.93     0.0           4   257    48   304     6   262
> 8gzdA                      1    6365   30950     259     304   53.36     0.0           5   247    44   299     2   254
> 7cwqA                      1    6836   45400     270     304   65.62     0.0           6   253    44   296     2   257
> 2fx5A                      1   12539    6550     258     304   28.24    0.28E-06       7   158    50   223    10   179
> 8etyA                      1   19431   36950     261     304   58.89     0.0           8   248    44

In this particular example, a BLOSUM62 similarity matrix is being used for determining the sequence identity between target and potential templates. Also, we are employing only one search iteration and the parameter max_aln_evalue is set to 0.01, indicating that only sequences with e-values smaller than or equal to 0.01 will be included in the final profile.

For simplicity, we just printed out the PDB table from the resulting log file generated during this analysis.

As you can see, several PDB files are indicated. The important columns to determine the best templates from this analysis are the fifth, sixth, seventh and eight columns, which correspond to the sequence length of the PDB hits and the target protein to be modelled, their sequence identity and e-value, respectively.

**QUESTION:** From this analysis, which template would be better for modeling the structure of our target sequence?

2. We will choose five PDB structures based on the sequence identity and e-value and select the most appropriate template for our target sequence among them. For this, we will first download these structures using the _Bio.PDB_ command from biopython, and then use the alignment.compare_structures() command to assess the structural and sequence similarity between the possible templates through the **compare.py** script.

  Please take a few minutes to examine the content of this script, particularly i) how are the different protein structures included within the script; and ii) which chain is being used from each structure. This is important for cases where only one of many chains in the PDB corresponds to the protein. Also, please note that there are two alignment steps: first, a sequence alignment; second, a structural alignment.

In [10]:
#Downloading the PDB files using biopython
from Bio.PDB import *
templates = ['6eqe', '7dzv', '7ec8', '7nei', '6sbn']
pdbl = PDBList()
for s in templates:
  pdbl.retrieve_pdb_file(s, pdir='.', file_format ="pdb", overwrite=True)
  os.rename("pdb"+s+".ent", s+".pdb")

Downloading PDB structure '6eqe'...
Downloading PDB structure '7dzv'...
Downloading PDB structure '7ec8'...
Downloading PDB structure '7nei'...
Downloading PDB structure '6sbn'...


In [11]:
#Downloading the build_profile.py script from GitHub
!wget https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/compare.py
#Check this script and change the names of the PDB files if required

--2024-10-02 23:10:43--  https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/compare.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 469 [text/plain]
Saving to: ‘compare.py’


2024-10-02 23:10:43 (15.8 MB/s) - ‘compare.py’ saved [469/469]



In [12]:
#Running the compare script
!mod10.4 compare.py
#Check the log file
!sed -ne '/Sequence identity comparison (ID_TABLE):/,$ p' compare.log

'import site' failed; use -v for traceback
Sequence identity comparison (ID_TABLE):

   Diagonal       ... number of residues;
   Upper triangle ... number of identical residues;
   Lower triangle ... % sequence identity, id/min(length).

         6eqeA @07dzvA @17ec8A @17neiA @16sbnA @1
6eqeA @0      265     203     137     123     127
7dzvA @1       77     268     137     128     129
7ec8A @1       52      52     265     111     155
7neiA @1       48      50      43     258     124
6sbnA @1       48      49      59      48     263


Weighted pair-group average clustering based on a distance matrix:


                                                               .--- 6eqeA @0.9    23.0000
                                                               |
             .----------------------------------------------------- 7dzvA @1.6    49.7500
             |
             |                .------------------------------------ 7ec8A @1.4    41.0000
             |                |
        

Several decisions can be made based on the results from this analysis, which were filtered in the previous code cell for simplicity.

The resulting comparison between structures shows that PDB 6eqe has the highest resolution (0.9 Å). Second, 7ec8, 6sbn and 7nei form a separate group of structures from 6eqe and 6dzv. Third, with the exception of the comparison between 6eqe and 7dzv, the selected enzymes are quite different between each other in terms of their amino acid sequence, with the difference being on average 48% (sequence difference percentage corresponds to the numbers on the right-hand side of the graph).

Thus, considering the sequence identity between the target sequence and 6eqe, its high resolution and other crystallographic quality features (that you can check on the **[PDB website](https://www.rcsb.org/)**), this is the best candidate for single-template modeling.

3. Now, we will **align the sequence of our template protein with the sequence of our target protein**, such that we can model the structure.

  How hard is it? Not at all! Just download the  **align2d.py** script into your working folder, check the script to verify how the sequence of the target and the protein structure are evoked, and execute the script as we have done before



In [13]:
#Downloading the align2D.py script from GitHub
!wget https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/align2D.py
#Check this script and change the names of the PDB files if required

--2024-10-02 23:11:30--  https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/align2D.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 452 [text/plain]
Saving to: ‘align2D.py’


2024-10-02 23:11:30 (5.62 MB/s) - ‘align2D.py’ saved [452/452]



In [17]:
#Running the align2D script
!mod10.4 align2D.py

'import site' failed; use -v for traceback


4. You will end up with two new files (aligned.ali and aligned.fasta) that contain the pairwise alignment of the target and template sequences. Load the FASTA file into [Alignment Viewer 2.0](https://fast.alignmentviewer.org/). You can also use our Colab-mounted MSA viewer below:

In [18]:
#@title Protein MSA Viewer in Google Colab
#The following code is modified from the wonderful viewer developed by Damien Farrell
#https://dmnfarrell.github.io/bioinformatics/bokeh-sequence-aligner

#Importing all modules first
import os, io, random
import string
import numpy as np

from Bio.Seq import Seq
from Bio.Align import MultipleSeqAlignment
from Bio import AlignIO, SeqIO

import panel as pn
import panel.widgets as pnw
pn.extension()

from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Plot, Grid, Range1d
from bokeh.models.glyphs import Text, Rect
from bokeh.layouts import gridplot

#Setting up the amino color code according to Zappo color scheme
def get_colors(seqs):
    #make colors for bases in sequence
    text = [i for s in list(seqs) for i in s]
    #Use Zappo color scheme
    clrs =  {'K':'red',
             'R':'red',
             'H':'red',
             'D':'green',
             'E':'green',
             'Q':'blue',
             'N':'blue',
             'S':'blue',
             'T':'blue',
             'A':'blue',
             'I':'blue',
             'L':'blue',
             'M':'blue',
             'V':'blue',
             'F':'orange',
             'Y':'orange',
             'W':'orange',
             'C':'blue',
             'P':'yellow',
             'G':'orange',
             '-':'white'}
    colors = [clrs[i] for i in text]
    return colors

#Setting up the MSA viewer
def view_alignment(aln, fontsize="9pt", plot_width=800):
    """Bokeh sequence alignment view"""

    #make sequence and id lists from the aln object
    seqs = [rec.seq for rec in (aln)]
    ids = [rec.id for rec in aln]
    text = [i for s in list(seqs) for i in s]
    colors = get_colors(seqs)
    N = len(seqs[0])
    S = len(seqs)
    width = .4

    x = np.arange(1,N+1)
    y = np.arange(0,S,1)
    #creates a 2D grid of coords from the 1D arrays
    xx, yy = np.meshgrid(x, y)
    #flattens the arrays
    gx = xx.ravel()
    gy = yy.flatten()
    #use recty for rect coords with an offset
    recty = gy+.5
    h= 1/S
    #now we can create the ColumnDataSource with all the arrays
    source = ColumnDataSource(dict(x=gx, y=gy, recty=recty, text=text, colors=colors))
    plot_height = len(seqs)*15+50
    x_range = Range1d(0,N+1, bounds='auto')
    if N>100:
        viewlen=100
    else:
        viewlen=N
    #view_range is for the close up view
    view_range = (0,viewlen)
    tools="xpan, xwheel_zoom, reset, save"

    #entire sequence view (no text, with zoom)
    p = figure(title=None, width= plot_width, height=50,
               x_range=x_range, y_range=(0,S), tools=tools,
               min_border=0, toolbar_location='below')
    rects = Rect(x="x", y="recty",  width=1, height=1, fill_color="colors",
                 line_color=None, fill_alpha=0.6)
    p.add_glyph(source, rects)
    p.yaxis.visible = False
    p.grid.visible = False

    #sequence text view with ability to scroll along x axis
    p1 = figure(title=None, width=plot_width, height=plot_height,
                x_range=view_range, y_range=ids, tools="xpan,reset",
                min_border=0, toolbar_location='below')#, lod_factor=1)
    glyph = Text(x="x", y="y", text="text", text_align='center',text_color="black",
                text_font="monospace",text_font_size=fontsize)
    rects = Rect(x="x", y="recty",  width=1, height=1, fill_color="colors",
                line_color=None, fill_alpha=0.4)
    p1.add_glyph(source, glyph)
    p1.add_glyph(source, rects)

    p1.grid.visible = False
    p1.xaxis.major_label_text_font_style = "bold"
    p1.yaxis.minor_tick_line_width = 0
    p1.yaxis.major_tick_line_width = 0

    p = gridplot([[p],[p1]], toolbar_location='below')
    return p

#Loading the viewer by indicating the MSA file and format to read
#@markdown Name of the MSA file (including the filetype)
MSAfile = 'aligned.fasta' #@param {type:"string"}
MSAformat = 'fasta' #@param {type:"string"}
aln = AlignIO.read(MSAfile,MSAformat)
p = view_alignment(aln, plot_width=900)
pn.pane.Bokeh(p)


    !pip install jupyter_bokeh

and try again.
  pn.extension()


  You will see that something odd is happening: **a large segment of the N-terminal has no equivalent residues in the template structure!**

<figure>
<center>
<img src='https://raw.githubusercontent.com/pb3lab/ibm3202/master/images/cm_03.png'/>
</center>
</figure>

  This exercise leads to a general recommendation: you should always check all possible information on your biological sequences and structures to identify if these conflicts have any biological explanation. **THIS IS FUNDAMENTAL**.

5. Go to the PDB website and check the annotation of the sequence of the structure that you are using as template (in this case, 6EQE) in **Protein Feature View**. What do you see?

  Based on your observations, decide whether you should truncate the sequence of your target before modeling, edit your target sequence and repeat the alignment, if required.


#Part III - Generate and visualize a comparative model using MODELLER

1. Once your target and template sequences are aligned, use the **model-single.py** script for finally obtaining a structure of your target through comparative modeling. Again, read the script and check how the sequences and structures are called in MODELLER through these scripts. In this case, we are also performing this step on a separate folder.

  Please note that 1 model is not enough, as there is an energy function defining the optimal position of atomic coordinates, thus different models will have different energies. Generally, between 50-100 are generated for sufficient evaluation.

**💡 HINT:** For our example, the generation of 50 models takes around 15 min on Google Colab, whereas 10 models are generated in about 3 min. You can edit the number of models to generate on the `model-single.py` script.

In [19]:
#Creating a new folder and copying the required files for MODELLER
%mkdir model-single
%cd model-single
%cp ../6eqe.pdb .
%cp ../aligned.ali .

/content/BAB86909.1/model-single


In [20]:
#Downloading the model-single.py script from GitHub
!wget https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/model-single.py
#Check this script and change the names of the PDB files if required

--2024-10-02 23:13:22--  https://raw.githubusercontent.com/pb3lab/ibm3202/master/scripts/model-single.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 717 [text/plain]
Saving to: ‘model-single.py’


2024-10-02 23:13:23 (28.9 MB/s) - ‘model-single.py’ saved [717/717]



In [21]:
#Running the model-single script
!mod10.4 model-single.py

'import site' failed; use -v for traceback


2. The output from this process is a bunch of PDB files, each one of them corresponding to a comparative model of our target protein, that are numbered from 1 up to the total number of models requested during comparative modeling.

  Also, the **model-single.log** output has the total potential energy for each structure,according to MODELLER’s DOPE (discrete optimized protein energy) score. For simplicity, this script was modified to indicate the model with the best DOPE score. We will be working only with the model with the best score for the remainder of the session.
  
  As an example, our best model during preparation of this tutorial showed the following DOPE score:

```
Top model: target.B99990025.pdb (DOPE score -28735.180)
```

3. Before we check the quality of our model, we will take a look at it on **py3Dmol**.

**💡 HINT:** We are creating a copy of our model and changing the chain id from A to B, in order to load both structures into py3Dmol

In [None]:
#Copying our best model with a new chain id
!sed "s/ A / B /g" target.B99990004.pdb > bestmodel.pdb

#Setting up py3Dmol for visualization
view=py3Dmol.view()
#Loading template
view.addModel(open('6eqe.pdb', 'r').read(),'pdb')
#Loading best DOPE score model
view.addModel(open('bestmodel.pdb', 'r').read(),'pdb')
#Coloring the structures by chain id
view.setStyle({'cartoon': {'colorscheme':'chain'}})
view.zoomTo()
view.setBackgroundColor('white')
view.show()

4. Finally, to check the stereochemical quality of the model and its comparison to experimentally solved structures, we will use the [SAVES server](https://saves.mbi.ucla.edu), which employs several structure-based scoring strategies:

* **VERIFY3D** (i.e. compatibility of an atomic 3D model to its 1D sequence when compared tothe energetics of good structures from the PDB).
* **ERRAT** (i.e. quality of non-bonded interactions of a region when compared to similar regions from highly refined structures).
* **PROCHECK** (stereochemical and geometrical quality of the model, via Ramachandran plots, sidechain rotamers, etc).

5. Download your best model, upload it to SAVES and wait for the results. Briefly:
- **Check the VERIFY3D results:** >80% of the residues should have an average score ≥ 0.2, whereas the score profile allows you to identify conflicting regions.
- **Check the Ramachandran plot:** Are there any residues outside the allowed regions? What types of residues are found within those regions? (Check it by clicking on each dot in the plot)
- **Check the errors in PROCHECK:** are the errors located within the loop regions?


<figure>
<center>
<img src='https://raw.githubusercontent.com/pb3lab/ibm3202/master/images/cm_04.png'/>
</center>
</figure>

While we will discuss some of these results at the end of this tutorial, we highly encourage you to read [this article](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007449), which contains more recommendations on comparative modelling of protein structures.

**This is the end of the fourth tutorial! Good Science!!**