# IMP.pmi Tutorial Handout

### Integrating EM and Crosslinking data to localize five subunits of RNA Polymerase III

Authors: Riccardo Pellarin, Max Bonomi, Charles Greenberg, Daniel Saltzberg, Jan Kosinski, Ben Webb

 - Institut Pasteur, CNRS, C3BI
 - UCSF, Department of Bioengineering and Therapeutic Sciences

The Python Modeling Interface (PMI) is a powerful set of tools designed
to handle all steps of the modeling protocol for
typical modeling problems. It is designed to be used by writing a set of
Python scripts.

IMP.pmi has been used to determine the architecture of several macromolecular complexes, for instance:

[26S-PIP](https://salilab.org/26S-PIPs), [Yeast 40S-eIF3](https://salilab.org/40S-eIF1-eIF3), [Human Complement](https://salilab.org/Complement), [exosome](https://salilab.org/exosome),
    [yeast mediator](https://salilab.org/mediator/), [Nup84](https://salilab.org/nup84), [TFIIH](https://salilab.org/tfiih), [Nup82](https://salilab.org/nup82/), [SEA complex](https://salilab.org/sea), and the [Nuclear Pore Complex](https://salilab.org/npc2018)
    
Each repository above contains the scripts and the data, as well as all the results, that are needed to reproduce the published results. 

Integrative modeling studies are deposited in the [PDB-Dev database](https://pdb-dev.wwpdb.org/), which is run by wwPDB. These deposits may link to auxiliary files in other databases (e.g. EMDB, SASBDB) or by DOI (e.g. Zenodo).

We will illustrate the use of IMP.pmi by determining the localization of five
subunits of RNA Polymerase III. In the first part we will be utilizing chemical cross-linking coupled with
mass spectrometry (XLMS) combined with comparative models of all subunits. In the second part we will also include cryo-electron microscopy (cryoEM). We will try
to determine the architecture of the complex,
hypothesizing that we know the architecture of the core, and aiming to localize the position of C53, C37, C34, C82, and C31. The example can be easily generalized to any other set of subunits.

> The quoted text (grey lateral bar) refers to the EM modeling parts, and can be skipped if the modelling is performed only using crosslinks


For more information on IMP.pmi, see [Saltzberg et al. 2018](https://salilab.org/pdf/Saltzberg_MethodsMolBiol_2019.pdf) or [Bonomi et al. 2018](https://salilab.org/pdf/Bonomi_Structure_2018.pdf).



## Installation

The current version of the tutorial is tested with IMP 2.17. This version can be installed on many plaforms using [Anaconda](https://anaconda.org/salilab/imp), which provides all the dependencies.

To work through the example on your own system, you will need the following
packages installed in addition to [IMP itself](https://integrativemodeling.org/nightly/doc/manual/installation.html):

- [numpy and scipy](http://www.scipy.org/scipylib/download.html)
  for matrix and linear algebra

- [scikit-learn](http://scikit-learn.org/stable/install.html)
  for k-means clustering

- [matplotlib](http://matplotlib.org/downloads.html)
  for plotting results

- [Chimera](https://www.cgl.ucsf.edu/chimera/download.html)
  for visualization of results

On a Mac you can get them using the
[pip](https://pypi.python.org/pypi/pip) tool, e.g. by running a command like
`sudo easy_install pip`, then install the packages with something like
`sudo pip install scikit-learn; sudo pip install matplotlib`. `numpy` and `scipy` are already installed on modern Macs. Something
similar may also work on a Linux box, although it's probably better to install
the packages using the distribution's package manager, such as `yum` or
`apt-get`.)

Then download the input files, either by 
[cloning the GitHub repository](https://github.com/Pellarin/imp_tutorial_pol3/tree/master)
or by [downloading the zip file](https://github.com/Pellarin/imp_tutorial_pol3/archive/master.zip).

## Colab Installation

To install IMP and related modules in a google colab framework, run the following script



In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()
!conda install -c salilab imp
!conda install matplotlib
!git clone https://github.com/Pellarin/imp_tutorial_pol3.git
%cd imp_tutorial_pol3/rnapoliii/modeling

## Content of this repository

A short tutorial introduction IMP is in the `doc` directory. The rnapoliii example scripts are contained in the directory `rnapoliii/modeling`. The advanced analysis is contained in the `rnapoliii/analysis`, and the deposition tutorial is in `rnapoliii/deposition`.

## Table of Contents

[//]: # (To compile the Table of Content run `python tools/compile_toc.py Tutorial.ipynb` and paste the output here below)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Background of RNA Polymerase III ](#3_Background_of_RNA_Polymerase_III)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Integrative Modeling using IMP ](#4_Integrative_Modeling_using_IMP)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ The four stages of Integrative Modeling ](#3_The_four_stages_of_Integrative_Modeling)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Running the script ](#3_Running_the_script)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Stage 1 - Gathering of data ](#Stage_1_2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Data for yeast RNA Polymerase III ](#Data_rnapoliii_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Stage 2 - Representation of subunits and translation of the data into spatial restraints ](#Stage_2_2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Setting up Model Representation and Degrees of Freedom in IMP ](#Setting_up_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Hierarchy ](#Hierarchy_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Dissecting the script ](#Dissecting_the_script_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Model Representation Using a Topology File. ](#Topology_file_4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Building the System Representation and Degrees of Freedom ](#Representation_and_DOF_4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Scoring Function ](#Scoring_Function_3) 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Connectivity Restraint ](#Connectivity_Restraint_4) 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Excluded Volume Restraint ](#Excluded_Volume_Restraint_4) 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Crosslinks - dataset 1 ](#Crosslink_1_4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Crosslinks - dataset 2 ](#Crosslink_2_4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Electron Microscopy Restraint ](#EM_4) 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Stage 3 - Sampling ](#Sampling_2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Modeling Output ](#Output_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[  Using `StatHierarchyHandler` for inline analysis ](#ProcessOutput_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Stage 4 - Analysis ](#Analysis_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Clustering top models using `analysis.py` ](#Clustering_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Structural Uncertainty of the solutions ](#uncertainty_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Accuracy evaluation ](#Accuracy_3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[ Sampling Exhaustiveness ](#Sampling_Exhaustiveness_3)



## Background of RNA Polymerase III <a name="3_Background_of_RNA_Polymerase_III"></a>

[RNA Pol III](http://en.wikipedia.org/wiki/RNA_polymerase_III) is a eukaryotic complex that catalyzes DNA transcription into ribosomal 5S rRNA and tRNA.  Eukaryotic RNA polymerase III contains 17 subunits. The yeast RNA Pol III dissociates into core, stalk, heterodimer (C53/C37), and heterotrimer (C82/C34/C31).


### Integrative Modeling using IMP <a name="4_Integrative_Modeling_using_IMP"></a>

This example will use data from chemical cross linking, EM and comparative models to localize the 5 subunits of the RNA Polymerase III heterodimer and heterotrimer, to a fixed core of the remaining 12 subunits.  

<img src="images/rnapoliii_scheme.png" alt="Drawing" style="width: 600px;"/>

### The four stages of Integrative Modeling <a name="3_The_four_stages_of_Integrative_Modeling"></a>

Structural modeling using IMP is divided into [four stages](https://integrativemodeling.org/2.11.1/doc/manual/intro.html#procedure).

Click the links below to see a breakdown of all the modeling steps.


* [Stage 1](#Stage_1_2)) Collect biophysical data that can be used as structural restraints and constraints
  
* [Stage 2](#Stage_2_2)) Define representations for the RNA Pol III structural model and define each data point as a scoring function.

* [Stage 3](#Sampling_2)) Run a sampling protocol to find good scoring conformations.  

* [Stage 4](#Analysis_3)) Analysis of the good scoring conformations.  Clustering; uncertainty; precision; etc...

## Stage 1 - Gathering of data <a name="Stage_1_2"></a>

In this stage, we find all available experimental data that we wish to utilize in structural modeling.  In theory, any method that provides information about absolute or relative structural information can be used.

### Data for yeast RNA Polymerase III <a name="Data_rnapoliii_3"></a>
The `rnapoliii/data` folder in the tutorial input files contains the data included in this example:

* Sequence information (FASTA files for each subunit)
* [10 Angstrom Electron density map](https://www.ebi.ac.uk/pdbe/entry/emdb/EMD-1804) (`.mrc`, `.txt` files)
* [High resolution structure from comparative modeling against Pol II structure](http://www.rcsb.org/pdb/explore/explore.do?structureId=1WCM) (PDB file) see below.
* Chemical crosslinking datasets (we use two data sets, the apo  and DNA bound complexes, both from [Ferber and Kosinski](https://www.ncbi.nlm.nih.gov/pubmed/27111507))


**FASTA File**  
Each residue included in modeling must be explicitly defined in the FASTA text file.  Each individual component (i.e., a protein chain) is identified by a string in the FASTA header line.

    >P20434
    MDQENERNISRLWRAFRTVKEMVKDRGYFITQEEVELPLEDFKAKYCDSMGRPQRKMMSF
    QANPTEESISKFPDMGSLWVEFCDEPSVGVKTMKTFVIHIQEKNFQTGIFVYQNNITPSA
    MKLVPSIPPATIETFNEAALVVNITHHELVPKHIRLSSDEKRELLKRYRLKESQLPRIQR
    ADPVALYLGLKRGEVVKIIRKSETSGRYASYRICM
    >P20435
    MSDYEEAFNDGNENFEDFDVEHFSDEETYEEKPQFKDGETTDANGKTIVTGGNGPEDFQQ
    HEQIRRKTLKEKAIPKDQRATTPYMTKYERARILGTRALQISMNAPVFVDLEGETDPLRI
    AMKELAEKKIPLVIRRYLPDGSFEDWSVEELIVDL

defines two chains with unique IDs of P20434 and P20435 respectively.  The entire complex is 17 chains and 6263 residues.

> **Electron Density Map**  
The electron density map of the entire RNA Poly III complex is at 10 Angstrom resolution.  The raw data file for this is stored in `emd_1804_10A_2010.mrc`.

> <figure><img src="images/rnapoliii_cryoEM.png" width="300px" />
<figcaption>_Electron microscopy density map for yeast RNA Polymerase III at 10 Ang resolution (emd 1804)_</figcaption></figure>
    
> **Electron Density as Gaussian Mixture Models**  
Gaussian mixture models (GMMs) are used to greatly speed up scoring by approximating the electron density of individual subunits and experimental EM maps.  Several GMMs has been created for the experimental density map, with different numbers of Gaussians, and are stored in the `*_gmm.mrc`.  The weight, center, and covariance matrix of each Gaussian used to approximate the original EM density can be seen in the corresponding `.txt` file.  See below for explanation on how to generate GMM and how to analyse them.

> <figure><img src="images/rnapoliii_gmm1.png" width="300px" />
> <figcaption>_The EM data represented as a 50 Gaussian mixture model_</figcaption></figure>

---

> <figure><img src="images/rnapoliii_gmm2.png" width="300px" />
> <figcaption>_The EM data represented as a 200 Gaussian mixture model_</figcaption></figure>

---

> <figure><img src="images/rnapoliii_gmm3.png" width="300px" />
> <figcaption>_The EM data represented as a 800 Gaussian mixture model_</figcaption></figure>

---

> <figure><img src="images/rnapoliii_gmm4.png" width="300px" />
> <figcaption>_The EM data represented as a 3200 Gaussian mixture model_</figcaption></figure>

---

> <figure><img src="images/rnapoliii_gmm5.png" width="300px" />
> <figcaption>_The EM data represented as a 12797 Gaussian mixture model_</figcaption></figure>


**PDB File**  
High resolution coordinates for all 17 chains of RNA Pol III are found in `.pdb` files.  

<figure><img src="images/rnapoliii_native.png" width="300px" />
<figcaption>_Native structure of Pol III [5FJA](http://www.rcsb.org/pdb/explore.do?structureId=5FJA)_</figcaption></figure>

---

<figure><img src="images/rnapoliii_core_homology.png" width="300px" />
<figcaption>_Homology model of the core of Pol III based on [4C3I](http://www.rcsb.org/pdb/explore.do?structureId=4C3I)_</figcaption></figure>

---

<figure><img src="images/rnapoliii_C82_homology.png" width="300px" />
<figcaption>_Homology model of C82 based on [2XUB](http://www.rcsb.org/pdb/explore.do?structureId=2XUB)_</figcaption></figure>

---

<figure><img src="images/rnapoliii_C37_C53_homology.png" width="300px" />
<figcaption>_Homology models of the C37 C53 heterodimer based on [4C3I](http://www.rcsb.org/pdb/explore.do?structureId=4C3I)_</figcaption></figure>

---

<figure><img src="images/rnapoliii_C34_homology.png" width="300px" />
<figcaption>_Homology model of three domains of C34 based on 
[2DK8](http://www.rcsb.org/pdb/explore.do?structureId=2DK8), 
[2DK5](http://www.rcsb.org/pdb/explore.do?structureId=2DK5), 
[1LDD](http://www.rcsb.org/pdb/explore.do?structureId=1LDD) </figcaption></figure>

**Chemical Cross-Links**  
All chemical cross-linking data is located in `FerberKosinski2016_apo.csv` and `FerberKosinski2016_apo.csv`.  These files contain multiple comma-separated columns; four of these specify the protein and residue number for each of the two linker residues. The length of the DSS/BS3 cross-linker reagent, 21 angstroms, will be specified later in the modeling script.  

## Stage 2 - Representation of subunits and translation of the data into spatial restraints <a name="Stage_2_2"></a>


In this stage, we will initially define a representation of the system. Afterwards, we will convert the data into spatial restraints.  This is performed using the script `modeling/modeling.py` and uses the
topology file, `topology_poliii.txt` and `topology_poliii_cryoem.txt`, to define the system components and their representation
parameters.

### Setting up Model Representation and Degrees of Freedom in IMP <a name="Setting_up_3"></a>

Very generally, the *representation* of a system is defined by all the variables that need to be determined based on input information, including the assignment of the system components to geometric objects (e.g. points, spheres, ellipsoids, and 3D Gaussian density functions). 

Our RNA Pol III representation employs **spherical beads** of varying sizes and **3D Gaussians** (if the EM data is integrated), which coarsen domains of the complex using several resolution scales simultaneously. Here below we show the representation for the Pol II case, but the same applies to Pol III.

<figure><img src="images/rnapolii_Multi-scale_representation.png" width="600px" />
<figcaption>_Multi-scale representation of Rpb1 subunit of RNA Pol II_</figcaption></figure>

The **spatial restraints** will be applied to individual resolution scales as appropriate. 

Beads and Gaussians of a given domain are arranged into either a rigid body or a flexible string, based on the crystallographic structures. 

The GMM of a subunit is the set of all 3D Gaussians used to represent it; it will be used to calculate the EM score. The calculation of the GMM of a subunit can be done automatically in the **topology file**.
For the purposes of this tutorial, we already created these for all subunits and placed them in the `rnapoliii/data` directory in their respective `.mrc` and `.txt` files (eg, `ABC14.5.0.txt`). 

In a **rigid body**, all the beads and the Gaussians of a given domain have their relative distances constrained during configurational sampling, while in a **flexible string** the beads and the Gaussians are restrained by the sequence connectivity. 


<figure><img src="images/rnapolii_rb.png" width="300px" />
<figcaption>_Rigid Bodies and beads_</figcaption></figure>

**super rigid bodies** are sets of rigid bodies and beads that will move together in an additional Monte Carlo move.

<figure><img src="images/rnapolii_srb.png" width="300px" />
<figcaption>_Super Rigid Bodies_</figcaption></figure>

**chain_of_super_rigid_bodies** are additional degrees of freedom along the connectivity chain of a subunit. It groups sequence-connected rigid domains and/or beads into overlapping pairs and triplets. Each of these groups will be moved rigidly. This mover helps to sample more efficiently complex topologies, made of several rigid bodies, connected by flexible linkers.

<figure><img src="images/rnapolii_cosrb.png" width="300px" />
<figcaption>_Chain of Super Rigid Bodies_</figcaption></figure>


### Hierarchy <a name="Hierarchy_3"></a>

A hierarchy in IMP is a tree that stores information on molecules, residues, atoms, etc., where the resolution of the representation increases as you move further from the root. IMP.pmi was designed to support a specialised multi-state/multi-copy/multi-resolution hierarchy

<figure><img src="images/rnapolii_hierarchy.png" width="600px" />
<figcaption>_PMI hierarchy_</figcaption></figure>

The **States** are used as putative structural and compositional alternatives of the system. 

Each **State** contains the **Molecules**, and each Molecule can occur in different stochiometric **Copies** (eg. here MolA has three identical copies: MolA.0, MolA.1, and MolA.2). 

The **Molecules** contains structures (ie, particles with coordinates, masses and radii) classified by several **resolutions**: Atomic (Resolution 0), Residues (Resolution 1), Fragments (Resolution > 1), and the Gaussians (Densities). 

All structures (except the densities) are represented by Spheres with appropriate radius and mass. The resolutions concur simultanously, therefore the same part of the molecule can be represented by several resolutions.

### Dissecting the script <a name="Dissecting_the_script_3"></a>

The Python script sets up the representation of the system and the restraint.

The first part of the script import the necessary libraries.

In [None]:
from __future__ import print_function

import IMP
import IMP.core
import IMP.pmi.restraints.crosslinking
import IMP.pmi.restraints.stereochemistry
import IMP.pmi.tools

import IMP.pmi.macros
import IMP.pmi.topology

# Hot fixes correcting minor bugs in IMP 2.17.0
import tutorial_util

import os
import sys

import warnings
warnings.filterwarnings('ignore')

There are two options. Either we use only the crosslinks, or we use the crosslinks with the cryoEM data. This option is encoded in the following variable, and will import relevant libraries.

In [None]:
cryoEM=False

if cryoEM:
    step=1
    import IMP.bayesianem
    import IMP.bayesianem.restraint

If you are running several parallel jobs in a replica exchange sampling scheme (see below in Step 3) you might use MPI library. To get the index of the current replica you need the following code.

In [None]:
try:
    import IMP.mpi
    print('ReplicaExchange: MPI was found. Using Parallel Replica Exchange')
    rex_obj = IMP.mpi.ReplicaExchange()
except ImportError:
    print('ReplicaExchange: Could not find MPI. Using Serial Replica Exchange')
    rex_obj = IMP.pmi.samplers._SerialReplicaExchange()

replica_number = rex_obj.get_my_index()

Then setup the relevant paths of the input files. 

In [None]:
datadirectory = "../data/"
output_directory = "./output"

if not cryoEM:
    topology_file = datadirectory+"topology_poliii.txt" 
else:
    topology_file = datadirectory+"topology_poliii.cryoem.txt" 

#### Model Representation Using a Topology File <a name="Topology_file_4"></a>

This part of the script defines the topology of the system, including the hierarchy, the representation and the degrees of freedom. This is the content of the file `../data/topology_poliii.txt`, which is in a table format:

In [None]:
'''
|molecule_name|color      |fasta_fn       |fasta_id|pdb_fn                           |chain|residue_range|pdb_offset|bead_size|em_residues_per_gaussian|rigid_body|super_rigid_body|chain_of_super_rigid_bodies|
|ABC23        |0.0,1.0,1.0|sequences.fasta|P20435  |Pol3_core_Model_on4c3i.pdb       |F    |1,END        |0         |10       |10                      |1         |1               |1                          |
|ABC10beta    |0.5,1.0,0.5|sequences.fasta|P22139  |Pol3_core_Model_on4c3i.pdb       |J    |1,END        |0         |10       |10                      |1         |1               |1                          |
|ABC14_5      |0.0,1.0,0.0|sequences.fasta|P20436  |Pol3_core_Model_on4c3i.pdb       |H    |1,END        |0         |10       |10                      |1         |1               |1                          |
|ABC27        |0.8,0.1,0.6|sequences.fasta|P20434  |Pol3_core_Model_on4c3i.pdb       |E    |1,END        |0         |10       |10                      |1         |1               |1                          |
|C25          |0.0,0.5,1.0|sequences.fasta|P35718  |Pol3_core_Model_on4c3i.pdb       |G    |1,END        |0         |10       |10                      |1         |1               |1                          |
|AC40         |1.0,0.0,0.0|sequences.fasta|P07703  |Pol3_core_Model_on4c3i.pdb       |C    |1,END        |0         |10       |10                      |1         |1               |1                          |
|C160         |0.7,0.7,0.7|sequences.fasta|P04051  |Pol3_core_Model_on4c3i.pdb       |A    |1,END        |0         |10       |10                      |1         |1               |1                          |
|ABC10alpha   |0.0,0.0,1.0|sequences.fasta|P40422  |Pol3_core_Model_on4c3i.pdb       |L    |1,END        |0         |10       |10                      |1         |1               |1                          |
|C128         |1.0,0.8,0.7|sequences.fasta|P22276  |Pol3_core_Model_on4c3i.pdb       |B    |1,END        |0         |10       |10                      |1         |1               |1                          |
|AC19         |1.0,0.6,0.0|sequences.fasta|P28000  |Pol3_core_Model_on4c3i.pdb       |K    |1,END        |0         |10       |10                      |1         |1               |1                          |
|C11          |1.0,1.0,0.0|sequences.fasta|Q04307  |Pol3_core_Model_on4c3i.pdb       |I    |1,END        |0         |10       |10                      |1         |1               |1                          |
|C17          |1.0,0.0,0.5|sequences.fasta|P47076  |Pol3_core_Model_on4c3i.pdb       |D    |1,END        |0         |10       |10                      |1         |1               |1                          |
|C31          |1.0,1.0,0.5|sequences.fasta|P17890  |BEADS                            |Q    |1,END        |0         |10       |10                      |2         |2               |2                          |
|C34          |0.9,0.6,0.1|sequences.fasta|P32910  |C34_wHTH1_Model_on2dk8_A.pdb     |P    |1,76         |0         |10       |10                      |3         |3               |3                          |
|C34          |0.9,0.6,0.1|sequences.fasta|P32910  |BEADS                            |P    |77,86        |0         |10       |10                      |4         |3               |3                          |
|C34          |0.9,0.6,0.1|sequences.fasta|P32910  |C34_wHTH2_Model_on2dk5_A.pdb     |P    |87,152       |0         |10       |10                      |5         |3               |3                          |
|C34          |0.9,0.6,0.1|sequences.fasta|P32910  |BEADS                            |P    |153,179      |0         |10       |10                      |6         |3               |3                          |
|C34          |0.9,0.6,0.1|sequences.fasta|P32910  |C34_wHTH3_Model_on1ldd_A.pdb     |P    |180,END      |0         |10       |10                      |7         |3               |3                          |
|C53          |0.4,0.8,1.0|sequences.fasta|P25441  |C37_C53_dimer.Model_on4c3i_MN.pdb|N    |1,END        |0         |10       |10                      |8         |8               |8                          |
|C37          |0.6,0.1,0.6|sequences.fasta|P36121  |C37_C53_dimer.Model_on4c3i_MN.pdb|M    |1,181        |0         |10       |10                      |8         |8               |8                          |
|C37          |0.6,0.1,0.6|sequences.fasta|P36121  |BEADS                            |     |182,END      |0         |10       |10                      |8         |8               |8                          |
|C82          |1.0,0.3,0.0|sequences.fasta|P32349  |C82_Model_on2xub_A.pdb           |O    |1,END        |0         |10       |10                      |9         |9               |9                          |
'''

Using the table above we define the overall topology: we introduce the molecules with their sequence and their known structure, and define the movers. Each line is a user-defined molecular **Domain**, and each column contains the specifics needed to build the system:

* `molecule_name`: Name of the Molecule and the name of the Hierarchy that contains the corresponding Domain.
* `color`: The color used in the output coordinates file. Uses [Chimera names](https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/colortables.html) (e.g. "red"), or R,G,B values as three comma-separated floating point numbers from 0 to 1 (e.g. "1.0, 0.0, 0.0") or a 6-digit hex string starting with '#' (e.g. 0xff0000).
* `fasta_fn`: Name of the FASTA file containing the sequence for this Molecule.
* `fasta_id`: header line of FASTA (without the ">" character).
* `pdb_fn`: Name of PDB file with coordinates (if available). If left empty, will set up as BEADS (you can also explictly specify "BEADS"). Can also write "IDEAL_HELIX".
* `chain`: Chain ID of this Domain in the PDB file.
* `residue_range`: Comma delimited pair defining the indexes of the first and the last residue of the Domain. Can leave empty or use 'all' for entire sequence from PDB file. The second item in the pair can be 'END' to select the last residue in the sequence defined in the FASTA file.
* `pdb_offset`: PMI always numbers residues starting from 1 (to match the FASTA file). If the PDB does not match this numbering, an offset can be specified here. It is added to the PDB residue number to get the FASTA number. For example, if the first residue in the PDB file is numbered 10, use an offset of -9.
* `bead_size`: The size (in residues) of beads used to model Fragments not covered by PDB coordinates.
* `em_residues`: The number of Gaussians used to model the density of this domain. Can be set to zero/empty to exclude Domains from the EM restraint. The GMM files will be written to `gmm_dir`.
* `rigid_body`: Leave empty to treat this Domain flexibly. Otherwise, use a unique Rigid Body identifier (an integer). All Domains with the same Rigid Body identifier will be collected in the same Rigid Body.
* `super_rigid_body`: Like the Rigid Body, the user can specify a unique Super Rigid Body identifier.
* `chain_of_super_rigid_bodies` Like the Rigid Body, the user can specify a unique Chain of Super Rigid Body identifier.

The first section defines where input files are located.  The topology file defines how the system components are structurally represented. `target_gmm_file` stores the EM map for the entire complex, which has already been converted into a Gaussian mixture model.

In [None]:
# Initialize IMP model
m = IMP.Model()

# Read in the topology file.  
# Specify the directory where the PDB files, FASTA files and GMM files are
topology = IMP.pmi.topology.TopologyReader(topology_file, 
                                  pdb_dir=datadirectory, 
                                  fasta_dir=datadirectory, 
                                  gmm_dir=datadirectory)

In [None]:
# Use the BuildSystem macro to build states from the topology file
bs = IMP.pmi.macros.BuildSystem(m)

In [None]:
# Each state can be specified by a topology file.
bs.add_state(topology)

#### Building the System Representation and Degrees of Freedom <a name="Representation_and_DOF_4"></a>

Here we can set the **Degrees of Freedom** parameters, which should be
optimized according to MC acceptance ratios. There are three kind of movers: Rigid Body, Bead, and Super Rigid Body. 

`max_rb_trans` and `max_rb_rot` are the 
maximum translation and rotation of the Rigid Body mover, `max_srb_trans` and 
`max_srb_rot` are the maximum translation and rotation of the Super Rigid Body mover
and `max_bead_trans` is the maximum translation of the Bead Mover.

The excecution of the macro will create the representation (as a molecular hierarchy) by reading the topology file, then return the root of this hierarchy (`root_hier`) and the degrees of freedom (`dof`) objects, both of which are used later on.


In [None]:
root_hier, dof = bs.execute_macro(max_rb_trans=4.0, 
                                  max_rb_rot=0.3, 
                                  max_bead_trans=4.0, 
                                  max_srb_trans=4.0,
                                  max_srb_rot=0.3)

At this point we have created the complete representation of the system. The representation should look like the following in Chimera.

<figure><img src="images/rnapoliii_representation_total.png" width="300px" />
<figcaption>_Domain Representation of the whole complex in its final modelled conformation_</figcaption></figure>

---

<figure><img src="images/rnapoliii_representation_core.png" width="300px" />
<figcaption>_Domain Representation of the core_</figcaption></figure>

---

<figure><img src="images/rnapoliii_representation_c31.png" width="300px" />
<figcaption>_Domain Representation of C31_</figcaption></figure>

---

<figure><img src="images/rnapoliii_representation_c34.png" width="300px" />
<figcaption>_Domain Representation of C34_</figcaption></figure>

---

<figure><img src="images/rnapoliii_representation_c53_c37.png" width="300px" />
<figcaption>_Domain Representation of C53/C37 heterodimer_</figcaption></figure>

---

<figure><img src="images/rnapoliii_representation_c82.png" width="300px" />
<figcaption>_Domain Representation of C82_</figcaption></figure>


We can display the representation of the system along the sequence. Each color corresponds to a domain of the complex assigned to an individual rigid body. White spaces are the beads.

In [None]:
%matplotlib inline

import IMP.pmi.plotting
import IMP.pmi.plotting.topology

IMP.pmi.plotting.topology.draw_component_composition(dof)

We randomize the initial configuration to remove any bias from the initial starting configuration read from input files. Since each subunit is composed of rigid bodies (i.e., beads constrained in a structure) and flexible beads, the configuration of the system is initialized by displacing each mobile rigid body and each bead randomly by 50 Angstroms, rotating them randomly, and moving them far enough from each other to prevent any steric clashes. 

The system will look like this after the initial randomization:

<figure><img src="images/rnapoliii_initial.png" width="600px" />
<figcaption>_Initial random configuration_</figcaption></figure>

In [None]:
# Shuffle the rigid body and beads configuration for all molecules

# if you use XL only 
if not cryoEM:
    IMP.pmi.tools.shuffle_configuration(root_hier,
                                        max_translation=50, 
                                        verbose=False,
                                        cutoff=5.0,
                                        niterations=100)

# otherwise you radomize only if you start a new cryoEM-XL modeling
else:
    if step==1:
        # Shuffle the rigid body configuration of only the molecules we are interested in (Rpb4 and Rpb7)
        # but all flexible beads will also be shuffled.
        IMP.pmi.tools.shuffle_configuration(root_hier,
                                        max_translation=300,
                                        verbose=True,
                                        cutoff=5.0,
                                        niterations=100)
                                        #excluded_rigid_bodies=fixed_rbs,


    else:
        rh_ref = RMF.open_rmf_file_read_only('seed_%d.rmf3'%(step-1))
        IMP.rmf.link_hierarchies(rh_ref, [root_hier])
        IMP.rmf.load_frame(rh_ref, RMF.FrameID(replica_number))

### Scoring Function <a name="Scoring_Function_3"></a>

After defining the representation of the model, we build the **restraints** by which the individual structural models will be scored based on the input data.

The sum of all of these restraints is our **scoring function**. 
For all restraints, calling `add_to_model()` incorporates them into the scoring function.
Appending the restraints to the `outputobjects` list reports them in the log files produced in the sampling.

In [None]:
outputobjects = [] # reporter objects...output is included in the stat file

#### Connectivity Restraint <a name="Connectivity_Restraint_4"></a>

In [None]:
# Connectivity keeps things connected along the backbone (ignores if inside same rigid body)
mols = IMP.pmi.tools.get_molecules(root_hier)
for mol in mols:
    molname=mol.get_name()        
    IMP.pmi.tools.display_bonds(mol)
    cr = IMP.pmi.restraints.stereochemistry.ConnectivityRestraint(mol,scale=2.0)
    cr.add_to_model()
    cr.set_label(molname)
    outputobjects.append(cr)

#### Excluded Volume Restraint <a name="Excluded_Volume_Restraint_4"></a>

This restraint simply keeps subunits from occupying the same space. We can safely apply this to the low-resolution representation of the system, for speed.

In [None]:
ev = IMP.pmi.restraints.stereochemistry.ExcludedVolumeSphere(
                                         included_objects=root_hier,
                                         resolution=10)
ev.add_to_model()         # add to scoring function
outputobjects.append(ev)  # add to output

#### Crosslinks - dataset 1 <a name="Crosslink_1_4"></a>

A crosslinking restraint is implemented as a distance restraint between two residues.  The two residues are each defined by the protein (component) name and the residue number.  The script here extracts the correct four columns that provide this information from the input data file, plus additional data if available such as a confidence score.

To use this restraint we have to first define the data format.  
 
In this case the data file, `data/FerberKosinski2016_apo.csv` is in a simple comma-separated variable (CSV) format, an excerpt of which is shown below:

```
Protein1,Protein2,AbsPos1,AbsPos2,ld-Score
C128,C53,570,370,50.4
C82,C34,313,204,47.18
...
```

In [None]:
# We then initialize a CrossLinkDataBase that uses a keywords converter to map column to information.
# The required fields are the protein and residue number for each side of the crosslink.
xldbkwc = IMP.pmi.io.crosslink.CrossLinkDataBaseKeywordsConverter()
xldbkwc.set_protein1_key("Protein1")
xldbkwc.set_protein2_key("Protein2")
xldbkwc.set_residue1_key("AbsPos1")
xldbkwc.set_residue2_key("AbsPos2")
xldbkwc.set_id_score_key("ld-Score")

xl1 = IMP.pmi.io.crosslink.CrossLinkDataBase(xldbkwc)
xl1.create_set_from_file(datadirectory+'FerberKosinski2016_apo.csv')
xl1.set_name("APO")

xl2 = IMP.pmi.io.crosslink.CrossLinkDataBase(xldbkwc)
xl2.create_set_from_file(datadirectory+'FerberKosinski2016_DNA.csv')
xl2.set_name("DNA")

# Append the xl2 dataset to the xl1 dataset to create a larger dataset
xl1.append_database(xl2)

# Rename one protein name
xl1.rename_proteins({"ABC14.5":"ABC14_5"})

# Create 3 confidence classes
xl1.classify_crosslinks_by_score(3)

# Now, we set up the restraint.
xl1rest = IMP.pmi.restraints.crosslinking.CrossLinkingMassSpectrometryRestraint(
                                   root_hier=root_hier,  # The root hierarchy
                                   database=xl1,# The XLDB defined above
                                   length=21.0,          # Length of the linker in angstroms
                                   slope=0.002,          # A linear term that biases XLed
                                                         # residues together
                                   resolution=1.0,       # Resolution at which to apply the restraint. 
                                                         # Either 1 (residue) or 0 (atomic)
                                   label="XL",           # Used to label output in the stat file
                                   weight=10.)           # Weight applied to all crosslinks 
                                                         # in this dataset
xl1rest.add_to_model()
outputobjects.append(xl1rest)


> #### Electron Microscopy Restraint <a name="EM_4"></a>
>
> The [GaussianEMRestraint](https://integrativemodeling.org/nightly/doc/ref/classIMP_1_1isd_1_1GaussianEMRestraint.html) uses a density overlap function to compare model to data.
First the EM map is approximated with a Gaussian Mixture Model (GMM, done separately).
Second, the components of the model are represented with Gaussians (forming the model GMM).
>
> * `slope`: nudge model closer to map when far away
> * `weight`: heuristic, needed to calibrate the EM restraint with the other terms. 
>
> and then add it to the Model and output objects.
> The restraint is fully described in this paper [Bonomi et al. 2018](https://salilab.org/pdf/Bonomi_Structure_2018.pdf).
>
> The GMMs are computed using an adapted version of [gmconvert](https://pdbj.org/gmfit/doc_gmconvert/README_gmconvert.html) which exploits a divide and conquer approach [recursive-gmconvert](https://gitlab.pasteur.fr/rpellari/recursive-gmconvert).
>
> The GMMs are computed using an increasing number of Gaussians. The optimal number of Gaussians is chosen as the one that reproduces the same resolution as the original map.
>
> <figure><img src="images/resolution_analysis.png" width="600px" />
> <figcaption>_Searching for the optimal number of Gaussians. Left. the Fourier Shell Correlation is computed between the experimental map and the GMM (each curve has a different number of components). Right. Resolution against the number of components_</figcaption></figure>

In [None]:
# First, get the model density objects that will be fitted to the EM density.

if cryoEM:
    target_gmm_file=datadirectory+'%d_imp.gmm'%(step)
    # First, get the model density objects that will be fitted to the EM density.
    densities = IMP.atom.Selection(root_hier, representation_type=IMP.atom.DENSITIES).get_selected_particles()
    gem = IMP.bayesianem.restraint.GaussianEMRestraintWrapper(densities,
                                                 target_fn=target_gmm_file,
                                                 scale_target_to_mass=True,
                                                 slope=0.01,
                                                 target_radii_scale=3.0,
                                                 target_is_rigid_body=False)

    gem.add_to_model()
    gem.set_label("Total")
    outputobjects.append(gem)

## Stage 3 - Sampling <a name="Sampling_2"></a>

With the system representation built and data restraints entered, the system is now ready to sample configurations. A replica exchange run can be set up using the [ReplicaExchange0](https://integrativemodeling.org/nightly/doc/ref/classIMP_1_1pmi_1_1macros_1_1ReplicaExchange0.html) macro. (Follow that link for a full description of all of the input parameters.)

Replica exchange greatly improves the sampling (see [wikipedia](https://en.wikipedia.org/wiki/Parallel_tempering) for a superficial description).

<figure><img src="images/rnapoliii_replica_exchange.png" width="600px" />

The sampling is performed by executing the macro:

```mc1.execute_macro()```


In [None]:
# total number of saved frames
num_frames = 5

# This object defines all components to be sampled as well as the sampling protocol
mc1=IMP.pmi.macros.ReplicaExchange0(m,
              root_hier=root_hier,                         # The root hierarchy
              monte_carlo_sample_objects=dof.get_movers()+xl1rest.get_movers(), # All moving particles and parameters
              output_objects=outputobjects,                # Objects to put into the stat file
              rmf_output_objects=outputobjects,            # Objects to put into the rmf file
              monte_carlo_temperature=1.0,   
              replica_exchange_minimum_temperature=1.0,
              replica_exchange_maximum_temperature=2.5,              
              simulated_annealing=False,
              number_of_best_scoring_models=0,
              monte_carlo_steps=10,
              number_of_frames=num_frames,
              save_coordinates_mode="25th_score",
              global_output_directory=output_directory)

# Start Sampling
mc1.execute_macro()

> The cryoEM modeling is slightly different. It uses an iterative approach where the model is refined using GMMs with an increasing number of components. At the end of each calculation a seed is generated (see below), and used to start a new modeling run.
>
> <figure><img src="images/rnapoliii_sampling_cryoem.png" width="600px" />
> <figcaption>_Iterative refinement based on seeding/increasing the components in the GMM_</figcaption></figure>

### Modeling Output <a name="Output_3"></a>

The script generates an output directory containing the following:

* `pdbs`: a directory containing the 10 best-scoring models (see the `number_of_best_scoring_models` variable above) from the run, in PDB format.
* `rmfs`: a single RMF file containing all the frames. RMF is a file format specially designed to store coarse-grained, multi-resolution and multi-state models such as those generated by IMP. It is a compact binary format and (as in this case) can also be used to store multiple models or trajectories. It stores the hierarchy and the coordinates of the particles, as well as information on each restraint, MC acceptance criteria and other things at each step.
* Statistics from the sampling, contained in a "statfile", `stat.*.out`. 

### Using `StatHierarchyHandler` for inline analysis <a name="ProcessOutput_3"></a>

We can use the class StatHierarchyHandler to analyse and plot the content of the RMF files.
This class coordinates the structures that have been generated 
and all the features that have been saved during the modeling run. It is a Hierarchy object, but it works like a list.
The Python script can be found in `modeling/short_analysis.py`.

In [None]:
import IMP.pmi.output

hh=IMP.pmi.output.StatHierarchyHandler(m,"./output/rmfs/0.rmf3")

#Total number of frames
print("Frames",len(hh))

# Describe the content of the first frame of the rmf file
print(hh[0])

#list down all the features names
for k in hh[0].features.keys(): print(k)
    


We can use the class [IMP.atom.Selection](https://integrativemodeling.org/nightly/doc/ref/classIMP_1_1atom_1_1Selection.html) to analyse the structures generated. 

In [None]:
# For instance we can compute the distance between two residues

%pylab inline

p0=IMP.atom.Selection(hh,molecule="C31",residue_index=10).get_selected_particles()[0]
p1=IMP.atom.Selection(hh,molecule="C34",residue_index=10).get_selected_particles()[0]

d0=IMP.core.XYZ(p0)
d1=IMP.core.XYZ(p1)

#note that hh can be used as a list
plot([IMP.core.get_distance(d0,d1) for h in hh]);

figure()

# Or we can get the radius of gyration of the whole complex
ps=IMP.atom.Selection(hh).get_selected_particles()
plot([IMP.atom.get_radius_of_gyration(ps) for h in hh])


Next, we plot the time series of selected features stored in the RMF file

In [None]:
# To reduce I/O, we can store the data structure internal to hh, 
# so that it is not read directly from the files
# and it is faster

data=hh.data

# Then we plot the scores
plot([x.score for x in data])

figure() 

# finally we plot the distance of two crosslinked residues
plot([float(x.features["CrossLinkingMassSpectrometryRestraint_Score_|XL|29.APO.1|C31|91|C160|1458|0|CLASS_0|"]) for x in data]);



Additionally, we can draw the scatter plot between ld-Scores and distances

In [None]:
scores={}

for xl in xl1: 
    if not xl['IntraRigidBody']:
        scores.update({xl['XLUniqueSubID']:float(xl['IDScore'])})
    

x=[]
y=[]
for k in data[0].features.keys():
    if "Distance" in k:
        id=k.split("|")[2]
        if id in scores:
            x.append(scores[id])
            y.append(float(data[2].features[k]))

scatter(x,y);

## Stage 4 - Analysis <a name="Analysis_3"></a>

In the analysis stage we cluster (group by similarity) the sampled models to determine high-probability configurations. Comparing clusters may indicate that there are multiple acceptable configurations given the data. 

In this stage we perform several analysis.  Here, we will perform calculations for:

* **Clustering**: Grouping the structure together using similarity via RMSD
* **Cluster Uncertainty**: Determining the within-group precision and between-group similarity via RMSD
* **Cluster Accuracy**: Fit of the calculated clusters to the true (known) solution
* **Sampling Exhaustiveness**: Qualitative and quantitative measurement of sampling completeness

### Precomputed results <a name="Precomputed_Results_3"></a>

A long modeling run was precomputed and analyzed. You can [download] it from our website, and you can [download](https://zenodo.org/record/3523241#.XbtB0y2ZMY2) from the Zenodo repository. For your convenience, the 150 best scoring models from each run were extracted and put in the `results` directory under the names `150_xl.rmf`: the XL-only modeling run; `150_xl_cryoem_1.rmf`: the first refinement of the XL+cryoEM modeling run; `150_xl_cryoem_2.rmf`: the second refinement of the XL+cryoEM modeling run; `150_xl_cryoem_1.rmf`: the third and last refinement of the XL+cryoEM modeling run. 

### Clustering top models <a name="Clustering_3"></a>
We use the [AnalysisReplicaExchange](https://integrativemodeling.org/nightly/doc/ref/classIMP_1_1pmi_1_1macros_1_1AnalysisReplicaExchange.html) class, which finds top-scoring models, extracts coordinates, runs clustering, and does basic cluster analysis including creating localization densities for each subunit. The script generates RMF, MRC files which should be viewable in Chimera.

We can choose the number of clusters by changing the distance threshold, the subunits we want to use to calculate the RMSD, and the number of good-scoring solutions to include.

If we perform sampling multiple times separately, they can all be analyzed at the same time by appending to list of stat files. The `best_models` parameter set the number of best scoring models to be analyzed. Note that we use `alignment=False`. This is needed in case there is no absolute reference frame (like an EM map).

In [None]:
are=IMP.pmi.macros.AnalysisReplicaExchange(m,
                 ["./results/150_xl_cryoem_3.rmf"],
                 best_models=150,
                 alignment=False)

print(are)

Then, we start the clustering. 
We specify the components used in calculating the RMSD between models. 
Then we cluster using a rmsd threshold of 10 Angstroms.

In [None]:
are.set_rmsd_selection(molecules=["C31","C34","C53","C37","C82"])

are.cluster(10.0)

For each cluster, we can print its information

In [None]:
# see the content of the "are" object
print(are)

#print the cluster info
for cluster in are:
    print(cluster)

We can get a given cluster by using the square bracket, as in lists, for instance `are[0]` is the cluster with index 0. We can iterate on the members of the cluster to display the infos. Afterwords, we save the coordinates of the cluster in a rmf file.

In [None]:
for member in are[0]:
    print(member)
    
are.save_coordinates(are[0])

Next we can examine the distances between all cluster members. A plot is output to a single file in the clustering directory. The first plot is the distance matrix of the models after being grouped into clusters. 

The second plot is a dendrogram, basically showing the distance matrix in a hierarchical way. Each vertical line from the bottom is a model, and the horizontal lines show the RMSD agreement between models. Sometimes the dendrogram can indicate a natural number of clusters, which can help determine the correct threshold to use. 

<img src="images/rnapoliii_rmsd_plot.png" alt="Distance matrix and dendrogram" width="600px" />


In [None]:
# slow!
are.plot_rmsd_matrix("rmsd_matrix.pdf")

### Bayesian crosslink restraint

For what concern the *Bayesian crosslink restraint*, we can plot the three classes (low-, mid-, and high-confidence) the distances of the crosslinks, as well as the corresponding values of the variables psi and sigma. 

In [None]:
keys=[k for k in are[0][0].features.keys() if "CrossLinkingMassSpectrometryRestraint" in k]
figure()
for ind in ["0","1","2"]:
    distances={}
    psi=[]
    sigma=[]
    for member in are[0]:
        for k in keys:
            if "CLASS_"+str(ind) in k and "Distance" in k:
                if k in distances:
                    distances[k].append(float(member.features[k]))
                else:
                    distances[k]=[float(member.features[k])]
            if "CLASS_"+str(ind) in k and "Psi" in k and "MonteCarlo" not in k:
                psi.append(float(member.features[k]))
            if "SIGMA" in k and "Sigma" in k and "MonteCarlo" not in k:
                sigma.append(float(member.features[k]))

    x=[distances[k] for k in distances]
    pylab.boxplot(x);
    figure()
    pylab.plot(range(len(psi)),psi);
    figure()
pylab.plot(range(len(sigma)),sigma);

> ### Create a seed<a name="seed_3"></a>
>
> As discussed above, a seed needs to be generated to use the iterative refinement protocol. A seed is a set of conformations, sampled during the modeling, which will be used to start each replica of the Replica Exchange sampling algorithm. In this case we need 48 structures in our seed.


In [None]:
### build the seed 

are.write_seed("seed.rmf3", 48)

### Structural uncertainty of the solutions <a name="uncertainty_3"></a>

The cluster center can be computed as the median structure. After that one can compute the precision of the cluster, as well as the average distance between two clusters.

In [None]:
are.compute_cluster_center(cluster=are[0])
print(are.precision(cluster=are[0]))
print(are.bipartite_precision(cluster1=are[0],cluster2=are[1]))

We can plot the root mean square fluctuation (rmsf) of a given molecule in a given cluster.

In [None]:
rmsf1=are.rmsf(cluster=are[0],molecule='C31');
plot(list(rmsf1.keys()),list(rmsf1.values()),marker=".",linewidth=0)
figure()

rmsf2=are.rmsf(cluster=are[0],molecule='C34');
plot(list(rmsf2.keys()),list(rmsf2.values()),marker=".",linewidth=0)

And compute the the rmsf for all molecules and map the value on the structure. Finally we save the colored coordinates in a rmf file.

<img src="images/rnapoliii_rmsf.png" alt="Structural uncertainty" width="600px" />

In [None]:
for mol in ['ABC23','ABC10beta','ABC14_5','ABC27','C25','AC40','C160','ABC10alpha','C128','AC19','C11','C17','C31','C34','C53','C37','C82']: 
    are.rmsf(cluster=are[0],molecule=mol);
ch1=IMP.pmi.tools.ColorHierarchy(are.stath1)
ch1.color_by_uncertainty()
are.save_coordinates(are[0])

We can save the localization densities of a given cluster, for given groups of molecules.
Now we specify the subunits (or groups or fractions of subunits) for which we want to create density localization maps. 
`density_names` is a dictionary, where the keys are convenient names like "C31" and the values are a list of selections. 

The localization densities can give a qualitative idea of the precision of a cluster. Below we show results from `Cluster-0` in the provided results. The localizations are quite narrow and close to the native solution:

<img src="images/rnapoliii_density.png" alt="Localization densities" width="600px" />

In [None]:
density_names={'core': ['ABC23','ABC10beta','ABC14_5','ABC27','C25','AC40','C160','ABC10alpha','C128','AC19','C11','C17'],
               'C53': ['C53'], 
               'C37': ['C37'], 
               'C34': ['C34'], 
               'C82': ['C82'], 
               'C31': ['C31']}

# you can iterate on the clusters
for n,a in enumerate(are):
    are.save_densities(cluster=a,density_custom_ranges=density_names,prefix="Cluster-"+str(n))

We can compute the global contact map of the whole complex for the cluster.

<img src="images/rnapoliii_contact_map.png" alt="Localization densities" width="600px" />

In [None]:
# it is slow
are.contact_map(cluster=are[0]);

### Ensemble analysis

Using the script in `rnapoliii/modeling/results/ensemble.py` you can get the analysis above for the four different ensembles of solutions: XL, XL+cryoEM with 50 gaussians (step 1), XL+cryoEM with 200 gaussians (step 2), XL+cryoEM with 800 gaussians (step 3). In the same directory you can find the files related to this analysis. Here below you can compare the results from the rmsf and localization (volume threshold set to 0.3 on all images)

<img src="images/rnapoliii_xl_ensemble.png" alt="Localization densities" width="400px" />
<figcaption>_RMSF of the XL ensemble_</figcaption></figure>

> <img src="images/rnapoliii_xl_cryoem_1_ensemble.png" alt="Localization densities" width="350px" />
> <figcaption>_RMSF of the XL cryoEM step 1 ensemble_</figcaption></figure>

> <img src="images/rnapoliii_xl_cryoem_2_ensemble.png" alt="Localization densities" width="400px" />
> <figcaption>_RMSF of the XL cryoEM step 2 ensemble_</figcaption></figure>

> <img src="images/rnapoliii_xl_cryoem_3_ensemble.png" alt="Localization densities" width="400px" />
> <figcaption>_RMSF of the XL cryoEM step 3 ensemble_</figcaption></figure>

<img src="images/rnapoliii_xl_ensemble_localization_0.3.png" alt="Localization densities" width="380px" />
<figcaption>_Localization of the XL ensemble_</figcaption></figure>

> <img src="images/rnapoliii_xl_cryoem_1_ensemble_localization_0.3.png" alt="Localization densities" width="400px" />
> <figcaption>_Localization of the XL cryoEM step 1 ensemble_</figcaption></figure>

> <img src="images/rnapoliii_xl_cryoem_2_ensemble_localization_0.3.png" alt="Localization densities" width="400px" />
> <figcaption>_Localization of the XL cryoEM step 2 ensemble_</figcaption></figure>

> <img src="images/rnapoliii_xl_cryoem_3_ensemble_localization_0.3.png" alt="Localization densities" width="380px" />
> <figcaption>_Localization of the XL cryoEM step 3 ensemble_</figcaption></figure>

### Sampling Exhaustiveness <a name="Sampling_Exhaustiveness_3"></a>
We can also determine sampling exhaustiveness by dividing the models into multiple sets, performing clustering on each set separately, and comparing the clusters. This is covered in the separate analysis tutorial in `rnapoliii/analysis`

### Model deposition

In `rnapoliii/deposition` we will describe the procedure used to deposit integrative modeling studies in the [PDB-Dev](https://pdb-dev.wwpdb.org/) database in mmCIF format.