# Protein structure superposing



## **Web-based tools**

### 1.1) PDBeFold

EMBL-EBI's [PDBeFold server](https://www.ebi.ac.uk/msd-srv/ssm/cgi-bin/ssmserver) enables us to search for similar structures to a given query against the entire [PDBe archive](https://www.ebi.ac.uk/pdbe/node/1). We can also modify our query to search the [CATH database](http://www.cathdb.info/search/by_sequence) instead. The output is similar to the results returned by a [BLAST search](https://www.uniprot.org/blast/). In fact, you might find it useful to combine the results from PDBeFold with data from (PSI-)BLAST searches, domain predictions from [InterPro](https://www.ebi.ac.uk/interpro/) and functional annotations collated by the [SIFTS](https://www.ebi.ac.uk/pdbe/docs/sifts/index.html) project. However, the search tool uses the less-accurate [SSM algorithm](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2692023/) for identifying similar protein structures. In section 1.2), the newer [Aggregate Views of Proteins](https://www.ebi.ac.uk/pdbe/pdbe-kb/protein) is introduced, which uses the more accurate [GESAMT algorithm](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5117261/) for superposing protein structures. 

### 1.2) Aggregate Views of Proteins

Integrated with the molecular graphics viewer [Mol*](https://molstar.org/), EMBL-EBI's [Aggregate Views of Proteins](https://www.ebi.ac.uk/pdbe/pdbe-kb/protein) page can be used to find proteins with similar structures to a query [UniProt accession](https://www.uniprot.org/) or PDB ID. The results are pre-calculated superpositions of protein structures deposited in the PDBe archive. Only structure matching the same segment of the UniProt sequence are considered for superposition, which is performed weekly to include newly submitted structures. The page also superposes using the more accurate [GESAMT algorithm](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5117261/) and clusters the structures by conformation using a parameter called the reverse Q-score. You can find out more details on the [page's wiki](https://github.com/PDBe-KB/pdbe-kb-manual/wiki/Superposition)

### 1.3) Foldseek

To search against the [AlphaFold database](https://alphafold.ebi.ac.uk/), [Foldseek](https://search.foldseek.com/search) provides users with a probability-ranked (by E-value) list of similar structures to the query. Foldseek also searches against the whole [PDBe archive](https://www.ebi.ac.uk/pdbe/node/1), if toggled. The listed results link to the hits' structures. However, a further superposition step is required to view hits in a molecular graphics viewer. We will look at this in the next step. 


#### **Practice exercise 1**: Searching for structures with similar folds

Use one of the provided protein structures to identify other structures with similar folds. You might want to inspect the results in your browser before downloading them to your local machine. Once you have some hits, query InterPro to explore the proteins' domain structure. Do your structure hits have similar domain structures? How does your structural superposition data inform your analysis of the query's domain structure?

----------------------------

## **Superposing structures locally**

Web-based services are extremely powerful, especially when searching against large databases such as the PDBe archive or AlphaFold Database. However, you might be working with newly solved protein structures which are on one of these databases or want to modify your superposition command beyond the scope of the options provided with these tools. This is where superposing structures on your own machine can be useful. Most superposition software can run on modern consumer hardware in a reasonable time, which makes exploring proteins on a laptop completely feasible. We will be looking at two powerful superposition algorithms used in structural biology packages such as Pymol and Coot.

Running software from the command line provides several benefits over GUI-based applications. One advantage is the (sometimes significant) lower computational overhead required to run command line applications, vs those which have good looking GUIs. This can make running operations faster, especially on older hardware or where a visual interface is not immediately necessary. Secondly, once we know the the parameters we need, we can execute the program without traversing menus, toggling optional fields and locating the files in cumbersome file system managers. Running software from the command line also allows us to quickly repeat executions, modify parameters and even incorperate them into our own scripts. We can also obtain a log from our execution of the program -- a report of programmatic events, errors, warnings, values, runtime and more -- which can sometimes be very useful when trouble shooting problems. Although not exclusive to command-line applications, GUIs often omit this information for brevity. 

Many tools for protein superposition exist as command-line tools. We will be using the structual biology tool suite CCP4 as it contains several excellent superposition algorithms. Many also exist as webservers, which can be used to query databases such as the PDBe or AlphaFoldDB, or modules we can load into our own code. Let's take a closer look at CCP4!

### 2.1) Setup CCP4

The software suite CCP4 contains several command-line programs for superposing structures. CCP4 should be installed on your virtual machine, but can be downloaded from [their website](https://www.ccp4.ac.uk/). Once installed, run the command below to enable you to execute programs from CCP4 as terminal commands:

> `source /path/to/ccp4-8.0/bin/ccp4.setup-sh`

You will need to adjust the path to point to the location of your CCP4 installation. For users working on EMBL-EBI's virtual machine, this is:

> ` change me source /path/to/ccp4-8.0/bin/ccp4.setup-sh`

Now you will be able to run any program offered by the CCP4 suite! You can find the documentation for all programs CCP4 contains [here](https://www.ccp4.ac.uk/html/). We will be limiting our use of CCP4 to only the programs useful for protein superposition in this tutorial. Let us begin by comparing the two superposition algorithms: SSM and GESAMT. 

### 2.2) Superpose: SSM 

> `superpose ./examples_mmcif/6mka.cif ./examples_mmcif/6mkf.cif -o superpose_example_output.pdb`

This command will superpose the structure `6mkf` to `6mka`, saving the new version to the file `superpose_example_output.pdb`. Either mmCIF or PDB file formats can be parsed into `superpose`, although the program currently returns the structure in PDB format only. 

### 2.3) Superpose: GESAMT

> `gesamt ./examples_mmcif/6mka.cif ./examples_mmcif/6mkf.cif -o gesamt_example_output.pdb`

This command will superpose the structure `6mkf` to `6mka`, saving the new version to the file `gesamt_example_output.pdb`. Either mmCIF or PDB file formats can be parsed into `gesamt`, although the program currently returns the structure in PDB format only. 

### 2.4) Viewing the results

In addition to saving our superposition as a PDB file, we are also provided with debug information printed to the terminal. Included is also information regarding the structural alignment, such as RMSD, Q-score and the rotation-translation matrice(s). Furthermore, we are also provided with the sequence identity and multiple-sequence alignment. We can capture this information by following our superposition command with the `>>` operation and the name of the file we want to send the information to. For example, 

> `gesamt test1.cif test2.cif -o test.pdb >> test.out`

Once you have saved all the debug information you might need later, you can now open the output PDB file in your favourite molecular graphics viewer. We suggest opening your results in [Mol* viewer](https://molstar.org/viewer/), a feature-rich online viewer that opens in your browser. Compare whether SSM and GESAMT give the same result. Many molecular graphics viewers are packaged with SSM as their default structural alignment tool. GESAMT was built on SSM to remediate some of its limitations, without a prohibitive runtime penalty. 


----------------------


## Parsing structure files

Gemmi is an efficient parser for the `mmCIF` file format. We will be using it here to parse in structures from the PDBe we have saved locally. Once loaded into the script, we will explore different protein structure superposition tools. 

In [None]:
import subprocess
import pathlib
import gemmi