In [None]:
# do not run after filtering the templates you want to use.
# !"./2.download_pdb.sh"

- Note that the code names in `build_profile.prf` (like "5ef9A"), is actually the protein code name ("5ef9") + the chain ("A").

- To filter out some of the templates, set them into "FAILED" in the "pdb_codes.txt" file.

- To select the most appropriate template for our query sequence over the similar structures, we will use the `Alignment.compare_structures()` command to assess the structural and sequence similarity between the possible templates.

Chosen pdb codes (this is done by hand for now):
```python
pdbs = [
    ['3l56', 'B', 'SUCCESS'],
    ['3kc6', 'A', 'SUCCESS'],
    ['6qpf', 'C', 'SUCCESS'],
    ['6euw', 'A', 'SUCCESS'],
    ['5fmm', 'A', 'SUCCESS']
 ]
 ```

In [5]:
from modeller import *
from pathlib import Path


pdb_dir = Path('../data/pdb')
out_dir = Path('../data/compare')

out_dir.mkdir(exist_ok=True)

# get the pdb and chain
with open(pdb_dir/'pdb_codes.txt', 'r') as f:
    pdbs = f.read().splitlines()

pdbs = [pdb.split()[:3] for pdb in pdbs if len(pdb) > 0]
pdbs = [pdb for pdb in pdbs if pdb[2] != 'FAILED']

print(len(pdbs))
pdbs


5


[['3l56', 'B', 'SUCCESS'],
 ['3kc6', 'A', 'SUCCESS'],
 ['6qpf', 'C', 'SUCCESS'],
 ['6euw', 'A', 'SUCCESS'],
 ['5fmm', 'A', 'SUCCESS']]

In [2]:
env = Environ()
aln = Alignment(env)

for (pdb, chain, _) in pdbs:
    pdb_file = str(pdb_dir/pdb)
    m = Model(env, file=pdb_file, model_segment=('FIRST:'+chain, 'LAST:'+chain))
    aln.append_model(m, atom_files=pdb_file, align_codes=pdb+chain)

# improve the alignment by calculating multiple sequence alignment
aln.malign()


                         MODELLER 10.6, 2024/10/17, r12888

     PROTEIN STRUCTURE MODELLING BY SATISFACTION OF SPATIAL RESTRAINTS


                     Copyright(c) 1989-2024 Andrej Sali
                            All Rights Reserved

                             Written by A. Sali
                               with help from
              B. Webb, M.S. Madhusudhan, M-Y. Shen, G.Q. Dong,
          M.A. Marti-Renom, N. Eswar, F. Alber, M. Topf, B. Oliva,
             A. Fiser, R. Sanchez, B. Yerkovich, A. Badretdinov,
                     F. Melo, J.P. Overington, E. Feyfant
                 University of California, San Francisco, USA
                    Rockefeller University, New York, USA
                      Harvard University, Cambridge, USA
                   Imperial Cancer Research Fund, London, UK
              Birkbeck College, University of London, London, UK


Kind, OS, HostName, Kernel, Processor: 4, Linux gpuless 6.8.0-49-generic x86_64
Date and time of compilation 

In [3]:
# do least-squares superposition of the 3D structures, using the multiple sequence alignment as its starting point
try:
    aln.malign3d()
except Exception as e:
    print(e)

# Sequence alignment of the structurally conserved regions
# [average distance and standard deviation are with respect
#  to the framework (i.e., average structure)]
#
#  N av ds st dv   3l56B  3kc6A  6qpfC  6euwA  5fmmA  
   1 0.587 0.357   Q 551  Q 551  Q 551  I 354  Q 551  
   2 0.725 0.504   W 552  W 552  W 552  R 355  W 552  
   3 0.585 0.255   I 554  I 554  I 554  V 356  I 554  
   4 1.002 0.724   W 557  W 557  W 557  H 357  W 557  
   5 0.965 0.491 * E 558  E 558  E 558  E 358  E 558  
   6 0.996 0.671   T 559  T 559  T 559  G 359  T 559  
   7 0.812 0.552   I 562  I 562  I 562  Y 360  I 562  
   8 0.678 0.350   W 564  W 564  W 564  E 361  W 564  
   9 0.731 0.419   P 568  P 568  P 568  E 362  P 568  
  10 0.542 0.351 * G 590  G 590  G 590  G 367  G 590  
  11 0.742 0.429   Q 591  Q 591  Q 591  A 370  Q 591  
  12 0.687 0.476   G 594  G 594  G 594  T 371  G 594  
  13 0.733 0.478   R 597  R 597  R 597  I 373  R 597  
  14 0.489 0.271 * L 599  L 599  L 599  L 384  L 599  
  15 1.

In [4]:
# compares the structures according to the alignment constructed by malign3d. It does not make an alignment, but it calculates the RMS and DRMS deviations between atomic positions and distances, differences between the mainchain and sidechain dihedral angles, percentage sequence identities, and several other measures.
aln.compare_structures()






COMPARISON OF SEVERAL 3D STRUCTURES:


  #  ALGNMT CODE      ATOM FILE                          

  1  3l56B
  2  3kc6A
  3  6qpfC
  4  6euwA
  5  5fmmA



Variability at a given position is calculated as: 
VAR = 1/Nij * sum_ij (feat_i - feat_j)
sum runs over all pairs of proteins with residues present.



>> Least-squares superposition (FIT)           :       T

   Atom types for superposition/RMS (FIT_ATOMS): CA
   Atom type for position average/variability (VARATOM): CA


   Position comparison (FIT_ATOMS): 

       Cutoff for RMS calculation:     3.5000

       Upper = RMS, Lower = numb equiv positions

        3l56B   3kc6A   6qpfC   6euwA   5fmmA   
3l56B      0.000   0.981   1.100   2.167   0.729
3kc6A        193   0.000   1.096   2.101   0.904
6qpfC        134     132   0.000   2.171   1.462
6euwA         28      28      23   0.000   2.239
5fmmA        133     132     190      23   0.000



   Distance comparison (FIT_ATOMS): 

       Cutoff for rms calculation:     3.5000


In [8]:
aln.id_table(matrix_file=str(out_dir/'family.mat'))
env.dendrogram(matrix_file=str(out_dir/'family.mat'), cluster_cut=-1.0)


Sequence identity comparison (ID_TABLE):

   Diagonal       ... number of residues;
   Upper triangle ... number of identical residues;
   Lower triangle ... % sequence identity, id/min(length).

         3l56B @23kc6A @26qpfC @36euwA @15fmmA @2
3l56B @2      205     187     128       2     130
3kc6A @2       94     198     127       2     129
6qpfC @3       62      64     727       2     150
6euwA @1        1       1       1     156       2
5fmmA @2       63      65      35       1     425


Weighted pair-group average clustering based on a distance matrix:


                                                               .--- 3l56B @2.3     6.0000
                                                               |
                                             .--------------------- 3kc6A @2.0    36.0000
                                             |
                                    .------------------------------ 5fmmA @2.4    51.0000
                                    |
        .---

From the (above) ID Table, we see:

- **6qpfC** has the most residues (727), indicating it might cover more of the target.
- **6euwA** has the fewest residues (156), suggesting it may represent a smaller fragment or region.
- **3l56B** vs. **3kc6A**: 94% sequence identity, indicating very high similarity.
- **3l56B** vs. **6qpfC**: 62%.
- **3kc6A** vs. **6qpfC**: 64%.
- **6euwA** has essentially no significant sequence similarity to the others (all 1%).

Looking at the Dendrogram:

- **3l56B** and **3kc6A** are the closest, forming the first branch (6.0 distance), confirming their similarity.
- **5fmmA** joins next (51.0 distance), showing moderate similarity to **3l56B** and **3kc6A**.
- **6qpfC** is more distant (99.0 distance), indicating it is structurally diverse.
- **6euwA** is the most divergent and joins the tree last.


Template Selection:
- **3kc6A** over **3l56B** and **5fmmA**, as it has lower crystallographic resolution. Although **5fmmA**, can be a good secondary structural features align.
- **6qpfC** has the highest residue count (727) and could provide coverage for larger regions. But it's more distant structurally (distance 99), so it might not be ideal as a primary template but can complement the main one. So we choose this over **6euwA**.


## Notes:
- `@2.4` means `2.4Å` (lower is better). This refers to the crystallographic resolution.

- Table Layout: Diagonal (e.g., 205, 198): Represents the total number of residues in the sequence of the respective template (e.g., 1b8pA, 1bdmA). Higher is better, as longer sequences typically have more information, though practical alignment length matters more.

- Table Layout: Upper Triangle (e.g., 194, 152): The number of identical residues between the respective pair of templates (e.g., between 1b8pA and 1bdmA).
Higher is better, as it indicates a closer match in sequence.


- Table Layout: Lower Triangle (e.g., 61, 48): The percentage sequence identity, normalized by the minimum length of the two sequences being compared (id/min(length)).
Higher is better, as it reflects stronger evolutionary or structural similarity between the templates.

In [9]:
chosen_template = "3kc6A"