# Protein structure superposing



## 3) **Superposing with Python**

Gemmi is an efficient parser for the `mmCIF` file format. We will be using it here to parse in structures from the PDBe we have saved locally. Once loaded into the script, we will explore different protein structure superposition tools. 

In [2]:
import subprocess
import pathlib
import gemmi
import Bio

### 3.1) Loading and saving structure files

Structure files, typically in PDB or mmCIF formats, can be parsed into Python scripts using the built-in read-write functions, such as `open()` and `write()`. However, tools have been developed to make this process quick and easy. We will be focusing on the file parsing module Gemmi as it is relatively quick and has good support for the modern mmCIF file format. Another set of functions for parsing structure files into Python is included in the popular Biopython library, which tends to focus on files in the PDB format. 


#### **Gemmi**

[Gemmi](https://gemmi.readthedocs.io/en/latest/) is a modern Python and C++ library containing many tools for performing common tasks in structural biology. It is well suited to handling mmCIF (PDBx/mmCIF) file formats, which are organised into blocks and loops of data.

To load a mmCIF file:

In [3]:
from gemmi import cif

# Location of saved structure file
path_mmcifs = "examples_mmcif/"
path_6mka = str( path_mmcifs + "6mka.cif" )

# Load file into program
mmcif_6mka = cif.read_file( path_6mka ).sole_block()

To extract information from the parsed mmCIF file, we can use the `find()` method, where we provide the blocks and loops of data we want to access. For a complete list of all data blocks and loops, refer to the [file format documentation](https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Categories/index.html) provided by the [wwPDB](http://www.wwpdb.org/). 

As an introductory example, the code below will extract the atom type, atom label, atom coordinates, occupancy and chain ID (author stated).

In [4]:
import numpy as np
import pandas as pd

# Extract data
mmcif_table =  mmcif_6mka.find(
    "_atom_site.",
    [
        "group_PDB",
        "label_atom_id",
        "Cartn_x",
        "Cartn_y",
        "Cartn_z",
        "occupancy",
        "auth_asym_id"
    ],
    )

# Convert to Numpy array
# ... not strictly necessary but improves runtime performance and data handling
mmcif_array = np.asarray(mmcif_table)

# Convert to Pandas DataFrame
# ... especially useful for data handling and selecting rows
mmcif_df = pd.DataFrame(
    mmcif_table,
    columns=[
        "group_PDB",
        "label_atom_id",
        "Cartn_x",
        "Cartn_y",
        "Cartn_z",
        "occupancy",
        "auth_asym_id"
        ]
    )

# Extract only the CA atoms
ca_atoms = mmcif_df.loc[mmcif_df["label_atom_id"] == "CA"]
print(ca_atoms)

     group_PDB label_atom_id Cartn_x  Cartn_y  Cartn_z occupancy auth_asym_id
1         ATOM            CA  82.685  -94.977  -15.597      1.00            A
8         ATOM            CA  80.654  -98.122  -14.961      1.00            A
13        ATOM            CA  80.000  -96.679  -11.488      1.00            A
18        ATOM            CA  79.238  -93.265  -13.028      1.00            A
25        ATOM            CA  76.595  -94.900  -15.227      1.00            A
...        ...           ...     ...      ...      ...       ...          ...
4618      ATOM            CA  11.986  -60.859  -39.142      1.00            A
4626      ATOM            CA  11.556  -57.766  -36.974      1.00            A
4635      ATOM            CA  15.055  -57.653  -35.462      1.00            A
4643      ATOM            CA  17.318  -59.045  -38.260      1.00            A
4655      ATOM            CA  16.831  -57.330  -41.613      1.00            A

[640 rows x 7 columns]


Gemmi can also load structure files as coordinate objects, a feature especially useful for superpositioning. We will be using this method of parsing for superposing structures but the former approach can be used to extract any information stored in the mmCIF file. 

In [46]:
# Load example structure into script
model_original = gemmi.read_structure('./examples_mmcif/1lzh_updated.cif') # PDB format accepted
print(model_original[0])

model = model_original[0]

# Assign static chain
static = model['A'].get_polymer()
# Assign mobile chains
mobile = model['B'].get_polymer()
ptype = static.check_polymer_type()     # Useful when not already known

# Perform QCP superposition
superposed = gemmi.calculate_superposition(
    static,               # Structure 1
    mobile,               # Structure 2
    ptype,                  # The type of macromolecule
    gemmi.SupSelect.CaP     # Select only CAs
    )

# RMSD from superposition
print("RMSD = ", superposed.rmsd)

# Some useful information on the superposition of the transformation
superposed_matrix = superposed.transform.mat
superposed_vector = superposed.transform.vec

superposed_matrix = np.asarray(superposed_matrix)   # Convert to ndarry to make visible
superposed_vector = np.asarray(superposed_vector)   # Convert to ndarry to make visible

print(superposed_matrix)
print(superposed_vector)

res_num = 50
print(model['A'].get_polymer()[res_num].sole_atom('CA'))
print(model['B'].get_polymer()[res_num].sole_atom('CA'))

# Transforming the mobile polymer based on superposition transformation matrix
mobile.transform_pos_and_adp(superposed.transform)

# Save suposition
model_original.write_pdb("example_qcp_superposition.pdb", 
          seqres_records= True,
          ssbond_records= True,
          link_records= True,
          cispep_records= True,
          ter_records = True,
          numbered_ter = True,
          ter_ignores_type = True,
          use_linkr= True
          )

model_original.make_mmcif_document().write_file('example_qcp_superposition.cif')

<gemmi.Model 1 with 2 chain(s)>
RMSD =  0.005110694766999929
[[ 0.97570792 -0.20759995  0.06997372]
 [ 0.21560375  0.96658406 -0.13867325]
 [-0.03884692  0.15039119  0.98786305]]
<gemmi.Vec3(-14.1959, 0.730062, -30.523)>
<gemmi.Atom CA at (6.8, 5.5, -16.4)>
<gemmi.Atom CA at (21.0, 2.4, 14.8)>


### 3.2) Command line tools via Python

Some tools are often only available as command line applications, meaning they lack a dedicated Python (or any other programming language) API. This prevents us from using the code for these tools directly in our code. However, we can still use these applications through the command line using the `subprocess` module. You will sometimes see the `os` module used for functionally similar purposes. Which you decide to use is often a matter of preference. We will be considering the `subprocess` module as it is rich in features relative to `os`. 

`subprocess` allows us to send commands to the terminal through Python. Take a look at the general example below:



In [2]:
import subprocess

# Define the command to run in the terminal
path = '.'
command = [f"ls -ltr {path}"]       # ls -ltr .

# Ensure results are formatted as strings
args = {
    "shell" : True, 
    "encoding" : "utf-8"
    }

# Send the command to the terminal for execution
results = subprocess.check_output(command, **args).splitlines()

# Display the results, line-by-line, from the command
for line in results:
    print(line)

total 17280
-rw-r--r--  1 jellaway  staff   405166 May  5 15:12 sup.out
-rw-r--r--  1 jellaway  staff   405166 May  5 15:19 superpose_example_output.pdb
-rw-r--r--  1 jellaway  staff   765940 May  5 16:01 gesamt_example_output.pdb
-rw-r--r--  1 jellaway  staff  3628476 May 14 14:55 1joy.pdb
drwxr-xr-x  8 jellaway  staff      256 May 14 14:55 figures
-rw-r--r--  1 jellaway  staff     5739 May 14 17:09 superposition_1_web.ipynb
-rw-r--r--  1 jellaway  staff  3603199 Jun  7 09:09 1joy_aligned.pdb
drwxr-xr-x  8 jellaway  staff      256 Jun  7 09:09 examples_mmcif
drwxr-xr-x  9 jellaway  staff      288 Jun  7 09:09 examples_pdb
-rw-r--r--  1 jellaway  staff    11616 Jun  7 09:55 superposition_2_local.ipynb
-rw-r--r--  1 jellaway  staff    14846 Jun  7 10:29 superposition_3_python.ipynb


Providing you are in the same directory in which this notebook is running, executing the command `ls -ltr .` will give you exactly the same result. 

To simply run a command in the terminal without saving the results, use:

In [8]:
import os

# Define the command to run in the terminal
path = '.'
command = f"ls -ltr {path}"       # ls -ltr .

# Execute the command in the terminal. Results not saved
results = os.system(command)

total 17296
-rw-r--r--  1 jellaway  staff   405166 May  5 15:12 sup.out
-rw-r--r--  1 jellaway  staff   405166 May  5 15:19 superpose_example_output.pdb
-rw-r--r--  1 jellaway  staff   765940 May  5 16:01 gesamt_example_output.pdb
-rw-r--r--  1 jellaway  staff  3628476 May 14 14:55 1joy.pdb
drwxr-xr-x  8 jellaway  staff      256 May 14 14:55 figures
-rw-r--r--  1 jellaway  staff     5739 May 14 17:09 superposition_1_web.ipynb
-rw-r--r--  1 jellaway  staff  3603199 Jun  7 09:09 1joy_aligned.pdb
drwxr-xr-x  8 jellaway  staff      256 Jun  7 09:09 examples_mmcif
drwxr-xr-x  9 jellaway  staff      288 Jun  7 09:09 examples_pdb
-rw-r--r--  1 jellaway  staff    11616 Jun  7 09:55 superposition_2_local.ipynb
-rw-r--r--  1 jellaway  staff    24476 Jun  7 10:04 superposition_3_python.ipynb


_NB_: `ls` lists all of the files and subdirectories in a specified folder, `-ltr` are flags that modify the command's behaviour, and `.` tells the command to work on _this_ directory. `-ltr` is equivalent to `-l -t -r`, which are flags to tell `ls` to provide extended file/directory information, order the results by date modified, and reverse this order to place most recent files at the bottom, respecitvely. 

We can, therefore, implement `subprocess` commands in Python scripts to run superposition applications. Of course, this package is not exclusive to superposition and can be used to run any command line application automatically.

The code below will serially superpose the structures in the `examples_mmcif` directory, using SSM and GESAMT.  

In [4]:
import subprocess

# Define the commands for running superposition applications
ssm = "superpose"
gesamt = "gesamt"

# Strings that will not change in script
path_structures = "./examples_mmcif/"
static_struct = "6mka.cif"
path_static = f"{path_structures}{static_struct}"

# Loop over the structures for superposition
for mobile_struct in ["6mkf.cif", "6mkg.cif", "6mkj.cif"]:

    # Path to mobile structure in superposition
    path_mobile = f"{path_structures}{mobile_struct}"

    # Out file names
    save_file_ssm = f"ssm_{mobile_struct[:4]}_to_{static_struct[:4]}.pdb"
    save_file_gesamt = f"gesamt_{mobile_struct[:4]}_to_{static_struct[:4]}.pdb"

    # SSM command ready
    command_ssm = [ssm, path_static, path_mobile, "-o", save_file_ssm]
    command_ssm = f"{ssm} {path_static} {path_mobile} -o {save_file_ssm}"
    # GESAMT command ready
    command_gesamt = f"{gesamt} {path_static} {path_mobile} -o {save_file_gesamt}"
    print(command_ssm)

    # Execute SSM command
    os.system(command_ssm)
    # Execute GESAMT command
    os.system(command_gesamt)

superpose ./examples_mmcif/6mka.cif ./examples_mmcif/6mkf.cif -o ssm_6mkf_to_6mka.pdb
superpose ./examples_mmcif/6mka.cif ./examples_mmcif/6mkg.cif -o ssm_6mkg_to_6mka.pdb
superpose ./examples_mmcif/6mka.cif ./examples_mmcif/6mkj.cif -o ssm_6mkj_to_6mka.pdb


sh: superpose: command not found
sh: superpose: command not found
sh: superpose: command not found


If you receive a `command not found` error message, instead run the `ssm_gesamt_via_python.py` script, which contains exactly the same code. 

### **Practice exercise 3:** Serial superposition

Using your results collected from the initial search of 5igh against one or of your chosen databases from exercise 1), write a Python script to superpose the models using either GESAMT or SSM. It might be useful to delegate some of the work across the group, allowing you to compare the SSM and GESAMT results to see if they agree. 

Once superposed, you can load your results into PyMol or Mol* and begin to identify common ligand-binding features between structures. 