%intersphinx https://python-ihm.readthedocs.io/en/latest
Deposition of integrative models {#mainpage}
================================

[TOC]

In this tutorial we will introduce the procedure used to deposit integrative modeling studies in the [PDB-Dev](https://pdb-dev.wwpdb.org/) database in mmCIF format.

We will demonstrate the procedure using [IMP](https://integrativemodeling.org/) and its PMI module, but the database will accept integrative models from any software, as long as they are compliant mmCIF files (e.g. there are several HADDOCK and Rosetta models already in PDB-Dev). 

# Why PDB-Dev? {#whypdbdev}

PDB-Dev is a database run by the wwPDB. It is specifically for the deposition of *integrative* models, i.e. models generated using more than one source of experimental data. (Models that use only one experiment generally go in PDB; models that use no experimental information - theoretical models - go in [ModelArchive](https://www.modelarchive.org/).)

# Why mmCIF? {#whymmcif}

wwPDB already uses the mmCIF file format for X-ray structures, and has a formal data model ([PDBx](http://mmcif.wwpdb.org/)) to describe these structures. The format is *extensible*; extension dictionaries exist to describe NMR, SAS, EM data, etc. For integrative models, we use PDBx plus an "integrative/hybrid methods" (IHM) [extension dictionary](http://mmcif.wwpdb.org/dictionaries/mmcif_ihm.dic/Index/). This supports coarse-grained structures, multiple input experimental data sources, multiple states, multiple scales, and ensembles related by time or other order. (Similarly, purely computational models, such as homology models or those generated by AlphaFold, are annotated with information such as quality scores, or templates and alignments used, using the ModelCIF [extension dictionary](https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Index/).)

# Why can't we convert PDB/RMF directly to mmCIF? {#whynotconvert}

This generally isn't possible because PDB, RMF and ``IMP::Model`` are designed to store one or more output models, where each model is a set of coordinates for a single conformation of the system being studied. A deposition, on the other hand, aims to cover a complete **modeling study**, and should capture not just the entire ensemble of output models, but also all of the input data needed to reproduce the modeling, and quality metrics such as the precision of the ensemble and the degree to which it fits the output data. A deposition is also designed to be visualized, and so may contain additional data not used in the modeling itself, such as preset colors or views to match figures in the publication or highlight regions of interest, and more human-descriptive names for parts of the system. Thus, deposition is largely a data-gathering exercise, and benefits from a modeling study being tidy and well organized (for example, by storing it in a [GitHub](https://github.com/) repository) so that data is easy to find and track to its source.

# Generation of mmCIF files {#mmcifgen}

mmCIF is a text format with a well-defined syntax, so in principle files could be generated by hand or with simple scripts. However, it is generally easier to use the existing [python-ihm library](https://github.com/ihmwg/python-ihm). This stores the same data as in an mmCIF file, but represents it as a set of Python classes, so it is easier to manipulate.

For deposition, we could use the python-ihm library directly, by writing a Python script that reads in output models and input data, adds annotations, and writes out an mmCIF file. However, since in this case we used ``IMP.pmi`` to do the modeling, we can make use of a class in PMI called [ProtocolOutput](@ref IMP.pmi.mmcif.ProtocolOutput) that automatically captures an entire ``IMP.pmi`` modeling protocol.

# Basic usage of ProtocolOutput {#basicusage}

``~IMP.pmi.mmcif.ProtocolOutput`` is designed to be attached to a top-level PMI object (usually ``IMP.pmi.topology.System``). Then, as the script is run, it will capture all of the information IMP knows about the modeling study, in an ``ihm.System`` object. Additional information not in the modeling script itself, such as the resulting publication, can then be added using the [python-ihm API](https://python-ihm.readthedocs.io/en/latest/usage.html).

We now proceed by modifying the script from the previous modeling tutorial to attach a ProtocolOutput object and capture modeling protocol information as mmCIF.

The first modification is to import the PMI mmCIF and python-ihm Python modules: 

In [None]:
from __future__ import print_function

# Imports needed to use ProtocolOutput
import IMP.pmi.mmcif
import ihm

The script then proceeds as before until we have set up our top-level ``IMP.pmi.topology.System`` object:

In [None]:
import IMP
import IMP.core
import IMP.pmi.restraints.crosslinking
import IMP.pmi.restraints.stereochemistry
import IMP.pmi.tools

import IMP.pmi.macros
import IMP.pmi.topology

# Hot fixes correcting minor bugs in IMP 2.17.0
import tutorial_util

import os
import sys

import warnings
warnings.filterwarnings('ignore')

cryoEM=True

if cryoEM:
    step=1
    import IMP.bayesianem
    import IMP.bayesianem.restraint

try:
    import IMP.mpi
    print('ReplicaExchange: MPI was found. Using Parallel Replica Exchange')
    rex_obj = IMP.mpi.ReplicaExchange()
except ImportError:
    print('ReplicaExchange: Could not find MPI. Using Serial Replica Exchange')
    rex_obj = IMP.pmi.samplers._SerialReplicaExchange()

replica_number = rex_obj.get_my_index()

datadirectory = "../data/"
output_directory = "./output"

if not cryoEM:
    topology_file = datadirectory+"topology_poliii.txt" 
else:
    topology_file = datadirectory+"topology_poliii.cryoem.txt"
    
# Initialize IMP model
m = IMP.Model()

# Read in the topology file.  
# Specify the directory where the PDB files, FASTA files and GMM files are
topology = IMP.pmi.topology.TopologyReader(topology_file, 
                                  pdb_dir=datadirectory, 
                                  fasta_dir=datadirectory, 
                                  gmm_dir=datadirectory)

# Use the BuildSystem macro to build states from the topology file
bs = IMP.pmi.macros.BuildSystem(m)

Now we can attach a ProtocolOutput object (BuildSystem contains a `system` member):

In [None]:
# Record the modeling protocol to an mmCIF file
po = IMP.pmi.mmcif.ProtocolOutput()
bs.system.add_protocol_output(po)
po.system.title = "Modeling of RNA Pol III"
# Add publication
po.system.citations.append(ihm.Citation.from_pubmed_id(25161197))

Note that the `ProtocolOutput` object `po` simply wraps an `ihm.System` object as `po.system`. We can then customize the `ihm.System` by setting a human-readable title and adding a citation (here we use ``ihm.Citation.from_pubmed_id``, which looks up a citation by PubMed ID - this particular PubMed ID is actually for the previously-published [modeling of the Nup84 complex](https://salilab.org/nup84/)).

Now the original script proceeds as before, setting up the representation and restraints:

In [None]:
# Each state can be specified by a topology file.
bs.add_state(topology)

root_hier, dof = bs.execute_macro(max_rb_trans=4.0,
                                  max_rb_rot=0.3,
                                  max_bead_trans=4.0,
                                  max_srb_trans=4.0,
                                  max_srb_rot=0.3)

import IMP.pmi.plotting
import IMP.pmi.plotting.topology

IMP.pmi.plotting.topology.draw_component_composition(dof)

# Shuffle the rigid body and beads configuration for all molecules

# if you use XL only 
if not cryoEM:
    IMP.pmi.tools.shuffle_configuration(root_hier,
                                        max_translation=50, 
                                        verbose=False,
                                        cutoff=5.0,
                                        niterations=100)

# otherwise you radomize only if you start a new cryoEM-XL modeling
else:
    if step==1:
        # Shuffle the rigid body configuration of only the molecules we are interested in (Rpb4 and Rpb7)
        # but all flexible beads will also be shuffled.
        IMP.pmi.tools.shuffle_configuration(root_hier,
                                        max_translation=300,
                                        verbose=True,
                                        cutoff=5.0,
                                        niterations=100)
                                        #excluded_rigid_bodies=fixed_rbs,

    else:
        rh_ref = RMF.open_rmf_file_read_only('seed_%d.rmf3'%(step-1))
        IMP.rmf.link_hierarchies(rh_ref, [root_hier])
        IMP.rmf.load_frame(rh_ref, RMF.FrameID(replica_number))
        
outputobjects = [] # reporter objects...output is included in the stat file

# Connectivity keeps things connected along the backbone (ignores if inside same rigid body)
mols = IMP.pmi.tools.get_molecules(root_hier)
for mol in mols:
    molname=mol.get_name()
    IMP.pmi.tools.display_bonds(mol)
    cr = IMP.pmi.restraints.stereochemistry.ConnectivityRestraint(mol,scale=2.0)
    cr.add_to_model()
    cr.set_label(molname)
    outputobjects.append(cr)

ev = IMP.pmi.restraints.stereochemistry.ExcludedVolumeSphere(
                                         included_objects=root_hier,
                                         resolution=10)
ev.add_to_model()         # add to scoring function
outputobjects.append(ev)  # add to output

# We then initialize a CrossLinkDataBase that uses a keywords converter to map column to information.
# The required fields are the protein and residue number for each side of the crosslink.
xldbkwc = IMP.pmi.io.crosslink.CrossLinkDataBaseKeywordsConverter()
xldbkwc.set_protein1_key("Protein1")
xldbkwc.set_protein2_key("Protein2")
xldbkwc.set_residue1_key("AbsPos1")
xldbkwc.set_residue2_key("AbsPos2")
xldbkwc.set_id_score_key("ld-Score")

xl1 = IMP.pmi.io.crosslink.CrossLinkDataBase(xldbkwc)
xl1.create_set_from_file(datadirectory+'FerberKosinski2016_apo.csv')
xl1.set_name("APO")

xl2 = IMP.pmi.io.crosslink.CrossLinkDataBase(xldbkwc)
xl2.create_set_from_file(datadirectory+'FerberKosinski2016_DNA.csv')
xl2.set_name("DNA")

# Append the xl2 dataset to the xl1 dataset to create a larger dataset
xl1.append_database(xl2)

# Rename one protein name
xl1.rename_proteins({"ABC14.5":"ABC14_5"})

# Create 3 confidence classes
xl1.classify_crosslinks_by_score(3)

# Now, we set up the restraint.
xl1rest = IMP.pmi.restraints.crosslinking.CrossLinkingMassSpectrometryRestraint(
                                   root_hier=root_hier,  # The root hierarchy
                                   database=xl1,# The XLDB defined above
                                   length=21.0,          # Length of the linker in angstroms
                                   slope=0.002,          # A linear term that biases XLed
                                                         # residues together
                                   resolution=1.0,       # Resolution at which to apply the restraint. 
                                                         # Either 1 (residue) or 0 (atomic)
                                   label="XL",           # Used to label output in the stat file
                                   weight=10.)           # Weight applied to all crosslinks 
                                                         # in this dataset
xl1rest.add_to_model()
outputobjects.append(xl1rest)

# First, get the model density objects that will be fitted to the EM density.

if cryoEM:
    target_gmm_file=datadirectory+'%d_imp.gmm'%(step)
    # First, get the model density objects that will be fitted to the EM density.
    densities = IMP.atom.Selection(root_hier, representation_type=IMP.atom.DENSITIES).get_selected_particles()
    gem = IMP.bayesianem.restraint.GaussianEMRestraintWrapper(densities,
                                                 target_fn=target_gmm_file,
                                                 scale_target_to_mass=True,
                                                 slope=0.01,
                                                 target_radii_scale=3.0,
                                                 target_is_rigid_body=False)

    gem.add_to_model()
    gem.set_label("Total")
    outputobjects.append(gem)

We can save time when it comes to the actual sampling by skipping it entirely (and using the previously-generated trajectory) by turning on ``~IMP.pmi.macros.ReplicaExchange0``'s `test_mode`:

In [None]:
# total number of saved frames
num_frames = 5

# This object defines all components to be sampled as well as the sampling protocol
mc1=IMP.pmi.macros.ReplicaExchange0(m,
              root_hier=root_hier,                         # The root hierarchy
              monte_carlo_sample_objects=dof.get_movers()+xl1rest.get_movers(), # All moving particles and parameters
              output_objects=outputobjects,                # Objects to put into the stat file
              rmf_output_objects=outputobjects,            # Objects to put into the rmf file
              monte_carlo_temperature=1.0,   
              replica_exchange_minimum_temperature=1.0,
              replica_exchange_maximum_temperature=2.5,              
              simulated_annealing=False,
              number_of_best_scoring_models=0,
              monte_carlo_steps=10,
              number_of_frames=num_frames,
              global_output_directory=output_directory,
              test_mode=True)

# Start Sampling
mc1.execute_macro()

Once we're done with our PMI protocol, we call the [ProtocolOutput.finalize](@ref IMP.pmi.mmcif.ProtocolOutput.finalize) method to collect all of the information about the integrative modeling protocol in ``ihm.System``:

In [None]:
po.finalize()

Note that the ``ihm.System`` object, stored as the `system` attribute of ``~IMP.pmi.mmcif.ProtocolOutput``, contains a full description of the system:

In [None]:
s = po.system

print("all subunits:",
      [a.details for a in s.asym_units])

print("first subunit sequence:",
      "".join(r.code for r in s.asym_units[0].entity.sequence))

print("all restraints on the system:", s.restraints)

Note that python-ihm only knows about two restraints, the crosslinking and the EM density restraint, not the connectivity or excluded volume. This is because PDB considers that all *valid* structures satisfy connectivity and excluded volume by definition.

# mmCIF file format {#mmcif}

At this point we have basic information which we can write out to an mmCIF file. Let's do that to get an idea of what the file looks like:

In [None]:
import ihm.dumper
with open('initial.cif', 'w') as fh:
    ihm.dumper.write(fh, [s])

mmCIF is a simple text format, so we can look at `initial.cif` in any text editor. Some simple data are stored in straightforward key:value pairs, e.g.

```
_struct.title 'Modeling of RNA Pol III'
```

More complex data can be stored as a table using a `loop` construct. This lists the fields (headings for the columns) after which the data (as rows) follow, e.g.

```
loop_
_software.pdbx_ordinal
_software.name
_software.classification
_software.description
_software.version
_software.type
_software.location
_software.citation_id
1 'IMP PMI module' 'integrative model building' 'integrative model building'
2.17.0 program https://integrativemodeling.org 2
2 'Integrative Modeling Platform (IMP)' 'integrative model building'
'integrative model building' 2.17.0 program https://integrativemodeling.org 3
3 MODELLER 'comparative modeling'
'Comparative modeling by satisfaction of spatial restraints, build 2013/11/13 18:10:39'
9.10 program https://salilab.org/modeller/ 4
4 MODELLER 'comparative modeling'
'Comparative modeling by satisfaction of spatial restraints, build 2013/11/15 19:51:28'
9.12 program https://salilab.org/modeller/ 4
```

Missing data are represented by `?` if they are unknown or `.` if they are deliberately omitted (for example if they don't make sense here).

In each case both the *categories* ([struct](http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Categories/struct.html), [software](http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Categories/software.html)) and the *data items* ([title](http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_struct.title.html), [pdbx_ordinal](http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_software.pdbx_ordinal.html), etc.) are defined by PDB, in the [PDBx dictionary](http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/). This is the same dictionary used for regular PDB entries (e.g. crystal structures). Categories prefixed with `ihm` are in the [IHM dictionary](http://mmcif.wwpdb.org/dictionaries/mmcif_ihm.dic/Index/), which is specific to integrative models.

# Linking to other data {#linking}

Integrative modeling draws on datasets from a variety of sources, so for a complete deposition all of this data need to be available. python-ihm provides an ``ihm.dataset`` module to describe various types of dataset, such as input experimental data, auxiliary output files (such as localization densities), or workflow files (such as Python scripts for modeling or visualization). For example, each restraint has an associated dataset:

In [None]:
print("restraint datasets:", [r.dataset for r in s.restraints])

The data are not placed directly in the mmCIF file - rather, the file contains links (using the ``ihm.location`` module). These links can be:

 - an identifier in a domain-specific database, such as PDB or EMDB.
 - a DOI where the files can be obtained.
 - a path to a file on the local disk.

**Database identifiers** are preferable because the databases are curated by domain experts and include domain-specific information, and the files are in standard formats. ProtocolOutput will attempt to use these where possible. When a file is used for the modeling which cannot be tracked back to a database, ProtocolOutput will include its path (relative to that of the mmCIF file). For example, in this case the cross-links used are stored in simple CSV files and are linked as such:

In [None]:
# Dataset for XL-MS restraint
d = s.restraints[0].dataset
print("XL-MS dataset at:", d.location.path)
print("Details:", d.location.details)

In addition, the Python script itself is linked from the mmCIF file. Such local paths won't be available to end users, so for deposition we need to replace these paths with database IDs or DOIs (more on that later).

Datasets all support a simple hierarchy, where one dataset can be derived from another. In this case the EM restraint uses a locally-available GMM file, but it is derived from a density map which is stored in EMDB. ProtocolOutput is able to read (using the [ihm.metadata module](@ref ihm.metadata)) metadata in both the GMM file and `.mrc` file to automatically determine this relationship:

In [None]:
# Dataset for EM restraint
d = s.restraints[1].dataset
print("GMM file at", d.location.path)
print("is derived from EMDB entry", d.parents[0].location.access_code)

As a further example of linkage, see the links in the previously-published [modeling of the Nup84 complex](https://salilab.org/nup84/) below. The mmCIF file links to the data directly used in the modeling (cross-links, crystal structures, electron microscopy class averages, comparative models, and Python scripts) via database IDs or DOIs. Furthermore, where available links are provided from this often-processed data back to the original data, such as templates for comparative models, mass spectometry spectra for cross-links, or micrographs for class averages:

<img src="images/links.png" width="700px" title="Nup84 file linkage" />

# Annotation of input files {#annotation}

ProtocolOutput, using python-ihm, will look at all input files to try to extract as much metadata as possible. As described above this is used to look up database identifiers, but it can also detect other inputs, such as the templates used for comparative modeling. Thus, it is important for deposition that all input files are annotated as well as possible:

 - deposit input files in a domain-specific database where possible and use the deposited file (which typically will contain identifying headers) in the modeling repository.
 - for PDB crystal structures, do not remove the original headers, such as the `HEADER` and `TITLE` lines.
 - for MODELLER comparative models, leave in the REMARK records and make sure that any files mentioned in `REMARK 6 ALIGNMENT:` or `REMARK 6 SCRIPT:` records are available (modify the paths if necessary, for example if you moved the PDB file into a different directory from the modeling script and alignment).
 - for manually generated PDB files, such as those extracted from a published work or generated by docking or other means, add suitable `EXPDTA` and `TITLE` records to the files for ProtocolOutput to pick up. See the [python-ihm docs](@ref ihm.metadata.PDBParser.parse_file) for more information.
 - for GMM files used for the EM density restraint, keep the original MRC file around and make sure that the `# data_fn:` header in the GMM file points to it.


# Polishing the deposition {#polishing}

ProtocolOutput attempts to automatically generate as much as possible of the mmCIF file, but there are some areas where manual intervention is necessary because the data is missing, or the guess was incorrect. This data can be
corrected by manipulating the ``ihm.System`` object directly. We will look at a few examples in this section.

## Cross-linker type {#xltype}

For cross-linking experiments, the mmCIF file contains a description of the cross-linking reagent used (e.g. the name of the chemical and the structure as a SMILES string). This information is not in the CSV file or the Python script. ProtocolOutput guesses the name of the reagent using the `label` passed to the PMI ``~IMP.pmi.restraints.crosslinking.CrossLinkingMassSpectrometryRestraint``,
as this label is often the name of the cross-linker. However, in this case it is not (we used the label `XL`).

We can correct this by looking up [the publication](https://doi.org/10.1038/nmeth.3838) to determine
that the [DSS](https://en.wikipedia.org/wiki/Disuccinimidyl_suberate) cross-linker was used. DSS is a common enough cross-linker that python-ihm already includes a definition for it in the ``ihm.cross_linkers`` module (for less common linkers we can create an ``ihm.ChemDescriptor`` object from scratch to describe its chemistry). Then we just set the linker type for each [cross-linking restraint](@ref ihm.restraint.CrossLinkRestraint) in the [list of all restraints](@ref ihm.System.restraints):


In [None]:
# Definitions of some common crosslinkers
import ihm.cross_linkers

# There should be exactly one XL restraint on the system
xl, = [r for r in s.restraints if isinstance(r, ihm.restraint.CrossLinkRestraint)]
xl.linker = ihm.cross_linkers.dss

## Correct number of output models {#fixnummodel}

ProtocolOutput captures information on the actual sampling, storing it in ``ihm.protocol.Step`` objects. Here it correctly notes that we ran Monte Carlo to generate 5 frames:

In [None]:
# Get last step of last protocol (protocol is an 'orphan' because
# we haven't used it for a model yet)
last_step = s.orphan_protocols[-1].steps[-1]
print(last_step.num_models_end)

However, in many modeling scenarios the modeling script is run multiple times on a compute cluster to generate several independent trajectories which are then combined. ProtocolOutput cannot know whether this happened. However, it is straightforward to use the python-ihm API to manually change the number of output models to match that reported in the publication:

In [None]:
# Correct number of output models to account for multiple runs
last_step.num_models_end = 200000

## Add model coordinates {#addcoords}

The current mmCIF file contains all of the input data, and notes that Monte Carlo was used to generate frames, but doesn't actually store any coordinates yet! We need to extract this information from clustering or other postprocessing that was done as part of our analysis. PMI's analysis and validation pipeline is still in development so doesn't yet provide this information automatically in the same way that ProtocolOutput does for the modeling step, so for now we add any information about clustering, localization densities, and final models to the file using the python-ihm API.

First, we write a simple function to get the names of the RMF files that constitute a cluster, by parsing output files from the analysis pipeline, and use it to get all structures for the largest cluster (`cluster.0`):

In [None]:
import os

def get_cluster_members(analysis_dir, cluster_num):
    """Get filenames of RMF files in the given cluster"""
    num_models = 0
    for sample in ('A', 'B'):
        file_for_id = {}
        with open(os.path.join(analysis_dir,
                               'Identities_%s.txt' % sample)) as fh:
            for line in fh:
                filename, model_id = line.rstrip('\r\n').split()
                file_for_id[model_id] = filename
        with open(os.path.join(analysis_dir,
                               'cluster.%d.sample_%s.txt' % (cluster_num, sample))) as fh:
            for line in fh:
                model_id = line.rstrip('\r\n')
                num_models += 1
                yield os.path.join(analysis_dir, file_for_id[model_id])
                # In the interests of speed and space, let's just get
                # 10 models from each sample
                if num_models % 10 == 0:
                    break
all_cluster_0_models = list(get_cluster_members('../analysis', 0))

Now that we know how many structures are in that cluster, we can add information about the clustering to the modeling protocol. We do this by using classes in the ``ihm.analysis`` module:

In [None]:
# Get last protocol in the file
protocol = s.orphan_protocols[-1]
# State that we filtered the 200000 frames down to one cluster:
import ihm.analysis
analysis = ihm.analysis.Analysis()
protocol.analyses.append(analysis)
analysis.steps.append(ihm.analysis.ClusterStep(
                      feature='RMSD', num_models_begin=200000,
                      num_models_end=len(all_cluster_0_models)))

python-ihm allows for models to be placed into groups (``ihm.model.ModelGroup``) so we can make such a group for the cluster. These groups are in turn grouped in an ``ihm.model.State``, a distinct state of the system (e.g. open/closed form, or reactant/product in a chemical reaction). Finally, states can themselves be grouped into ``ihm.model.StateGroup`` objects. All of these groups act like regular Python lists. In this case we have only one state, so we just add our model group to that:

In [None]:
mg = ihm.model.ModelGroup(name="Cluster 0")

# Add to last state
s.state_groups[-1][-1].append(mg)

To add a model to the file, we create an ``ihm.model.Model`` object, add atoms or coarse-grained objects to it, and then add that to the previously-created ``~ihm.model.ModelGroup``. ProtocolOutput provides a convenience function `add_model` which does this, converting the current IMP structure to python-ihm (we just need to load an IMP structure first, e.g. from an RMF file using ``IMP.rmf.load_frame``).

We want to store more than just a single model in the file, since a single model cannot represent the flexibility and variability of the complete ensemble. mmCIF files allow for storing multiple conformations (like a traditional PDB file for storing an NMR ensemble) and also representations of the entire ensemble (localization densities).

However, mmCIF is a text format and is not best suited for storing large numbers of structures. In this case we can store the coordinates in a simpler, binary file and link to it from the mmCIF using the ``ihm.location.OutputFileLocation`` class, providing one or more representative structures (such as the cluster centroid) in the mmCIF file itself. [DCD](https://www.ks.uiuc.edu/Research/namd/2.9/ug/node11.html) is one such format that we can (ab)use for this purpose (it is really designed for atomic trajectories). python-ihm provides an ``ihm.model.DCDWriter`` class which, given an ``ihm.model.Model``, will output DCD. Combining these, we can generate a DCD file for all models in the largest cluster:

In [None]:
import RMF

# Make DCD of all models in cluster
dcd_filename = '../analysis/cluster.0/allmodels.dcd'
num_models = 0
with open(dcd_filename, 'wb') as dcd_fh:
    dcd = ihm.model.DCDWriter(dcd_fh)
    for model_file in all_cluster_0_models:
        rh = RMF.open_rmf_file_read_only(model_file)
        IMP.rmf.link_hierarchies(rh, [root_hier])
        IMP.rmf.load_frame(rh, RMF.FrameID(0))
        del rh
        m = po.add_model(mg)
        dcd.add_model(m)
        # We only want the model in DCD, not mmCIF
        del mg[-1]

dcd_location = ihm.location.OutputFileLocation(
    path=dcd_filename,
    details="Coordinates of all structures in the largest cluster")

Next, we can describe the cluster using the ``ihm.model.Ensemble`` class, noting that it was generated from our clustering, and contains structures stored in the DCD file:

In [None]:
e = ihm.model.Ensemble(model_group=mg,
                       num_models=len(all_cluster_0_models),
                       post_process=analysis.steps[-1],
                       name="Cluster 0", file=dcd_location)
s.ensembles.append(e)

Finally, we can add coordinates for the cluster centroid to the mmCIF file itself, in a similar fashion.

In [None]:
# Add the cluster center model from RMF
rh = RMF.open_rmf_file_read_only('../analysis/cluster.0/cluster_center_model.rmf3')
IMP.rmf.link_hierarchies(rh, [root_hier])
IMP.rmf.load_frame(rh, RMF.FrameID(0))
del rh
m = po.add_model(mg)

We can also add information about the fit of the model to each of the restraints. Most python-ihm restraint objects contain a `fits` member which is a Python dict with ``ihm.model.Model`` objects as keys and some sort of restraint-fit object as values. For example, we can note that the cluster centroid model `m` was fit against the EM map by adding an ``ihm.restraint.EM3DRestraintFit`` object. This object allows us to give the cross correlation coefficient (if known, otherwise we can use the Python value `None`). We could also specify the Bayesian nuisances ψ and σ for our cross-links with ``ihm.restraint.CrossLinkFit`` objects.

In [None]:
em, = [r for r in s.restraints if isinstance(r, ihm.restraint.EM3DRestraint)]
em.fits[m] = ihm.restraint.EM3DRestraintFit(cross_correlation_coefficient=None)

If we have localization densities from the analysis for the cluster, we can add those to the Ensemble as ``ihm.model.LocalizationDensity`` objects. These take python-ihm subunits (or subsets of them) as ``ihm.AsymUnit`` objects. ProtocolOutput provides an `asym_units` member, which is a Python dict, to get python-ihm subunits given PMI component names (such as "C31.0"):

In [None]:
for comp in ('C31', 'C34', 'C82', 'C53', 'C37'):
    asym = po.asym_units['%s.0' % comp]
    loc = ihm.location.OutputFileLocation('../analysis/cluster.0/LPD_%s.mrc' % comp)
    den = ihm.model.LocalizationDensity(file=loc, asym_unit=asym)
    e.densities.append(den)

## Replace local links with DOIs {#adddois}

We added paths for localization densities and the DCD file with ``ihm.location.FileLocation`` objects. These files won't be accessible to end users since they live on the local disk. ([Recall from earlier](#linking) that ProtocolOutput also adds local paths for the cross-link CSV files.)

`FileLocation` takes an optional `repo` argument which, if given, is an ``ihm.location.Repository`` describing the DOI where the files can be found, generally as a `zip` file (the first argument to ``ihm.location.FileLocation`` is a path within the repository if `repo` is specified, otherwise a path on the local disk). So we could explicitly refer to a DOI for each of our external files. (A number of services provide DOIs for uploaded files, such as [Zenodo](https://zenodo.org/) or [FigShare](https://figshare.com/).)

An alternative is to retroactively change the existing `FileLocation`s (both those created automatically by ProtocolOutput and manually by us using python-ihm) using the `root` and `top_directory` arguments to ``~ihm.location.Repository`` and the ``ihm.System.update_locations_in_repositories`` function:

In [None]:
datarepo = ihm.location.Repository(doi="10.5281/zenodo.3526621",
            root="../data", top_directory='data',
            url="https://zenodo.org/record/3526621/files/data.zip")
clusterrepo = ihm.location.Repository(doi="10.5281/zenodo.3526621",
            root="../analysis/cluster.0", top_directory='cluster.0',
            url="https://zenodo.org/record/3526621/files/cluster.0.zip")
s.update_locations_in_repositories([datarepo, clusterrepo])

Essentially, what this does is to rewrite all `FileLocation`s pointing to files under `root` (`../data`) to instead be found in the `top_directory` (`data`) subdirectory inside `url/doi` (`data.zip` at the DOI [10.5281/zenodo.3526621](https://doi.org/10.5281/zenodo.3526621)). (A mandatory DOI is provided for permanence; an optional URL is provided to make it easier for software such as ChimeraX to download the file.) For example the local file `../data/foo/bar` would be found as `data/foo/bar` inside the zip file. A similar rewrite is done for the largest cluster files in `../analysis/cluster.0/`. We can now see that our input experimental information and output files point to DOIs:

In [None]:
# Dataset for XL-MS restraint
d = s.restraints[0].dataset
print("XL-MS dataset at %s/%s inside %s"
      % (d.location.repo.top_directory,
         d.location.path, d.location.repo.url))

# First localization density
d = e.densities[0]
print("Localization density at %s/%s inside %s"
       % (d.file.repo.top_directory,
          d.file.path, d.file.repo.url))

# Output {#output}

We can then save the entire protocol to an mmCIF file (or to [BinaryCIF](https://github.com/dsehnal/BinaryCIF), if we have the Python [msgpack](https://pypi.org/project/msgpack/) package installed) using the ``ihm.dumper.write`` function:

In [None]:
import ihm.dumper
with open('rnapoliii.cif', 'w') as fh:
    ihm.dumper.write(fh, [s])

#with open('rnapoliii.bcif', 'wb') as fh:
#    ihm.dumper.write(fh, [s], format='BCIF')

(BinaryCIF stores all the same data as mmCIF, but rather than storing it as a text file, the data is compressed and then written out as a [MessagePack](https://msgpack.org/index.html) file, which is a form of binary JSON.)

# Visualization {#visualization}

## ChimeraX {#chimerax}

mmCIF files can be viewed in many viewers. However, most viewers do not yet support the integrative modeling extensions, and so may only show the atomic parts of the model (if any). Integrative models can be viewed in [ChimeraX](https://www.rbvi.ucsf.edu/chimerax/) - be sure to use a recent nightly build, and open the file from the ChimeraX command line using the `format ihm` option, e.g. `open rnapoliii.cif format ihm`. (If you also want to see the DCD file, add `ensembles true` to the end of your command line to load it and then use the ChimeraX `coordset` command to manipulate the set of coordinates, e.g. `coordset slider #1.3.2`.)

ChimeraX also supports downloading and displaying structures directly from the [PDB-Dev](https://pdb-dev.wwpdb.org/) database, for example from the ChimeraX command line `open 10 from pdbdev`.

> Note that even though PDB-Dev is still quite small, the models vary widely in composition, e.g.
>
> - Some models (e.g. PDBDEV_00000018) are atomic and look much like a traditional "PDB" x-ray structure.
> - Some models (e.g. PDBDEV_00000002) contain multiple states that have different sequences (this particular case contains exosome in nucleus-localized and cytoplasm-localized forms).
> - Some models are not of proteins at all - e.g. PDBDEV_00000008 is a model of chromatin, part of the fly genome.
> - Some models contain multiple representations - e.g. PDBDEV_00000012, a model of the yeast nuclear pore complex, contains both models of the scaffold (ring) and the flexible FG repeat regions which occupy the center of the pore.

[VMD](http://www.ks.uiuc.edu/Research/vmd/) is also reportedly working on support in their forthcoming 1.9.4 release.

An example view of the deposition, as rendered by ChimeraX, is shown below. The EM map is shown as a mesh, and the DSS cross-links as green dashed lines. ChimeraX also shows the comparative models used as starting guesses for the subunit structures - toggle off the display of the result models to see them.

<img src="images/chimerax.png" width="700px" title="ChimeraX visualization of rnapoliii mmCIF file" />

## Web browser {#browser}

mmCIF files can also be viewed in a web browser, using [Mol* Viewer](https://molstar.org/viewer). (Note that it currently only shows the coordinates, not the experimental information or other metadata.) An example rendering is shown below.

<img src="images/molstar.png" width="700px" title="Molstar visualization of rnapoliii mmCIF file" />

## Plain text {#plain}

mmCIF files are also just plain text files, and can be viewed in any text
editor (for example to check for errors, particularly for categories that
ChimeraX doesn't support yet). Most of the data are stored as simple tables
(look for the `loop_` keyword in the file). For example, the coordinates of
the coarse-grained beads are stored in the [_ihm_sphere_obj_site table](http://mmcif.wwpdb.org/dictionaries/mmcif_ihm.dic/Categories/ihm_sphere_obj_site.html),
the start of which in `rnapoliii.cif` looks like:

```
loop_
_ihm_sphere_obj_site.id
_ihm_sphere_obj_site.entity_id
_ihm_sphere_obj_site.seq_id_begin
_ihm_sphere_obj_site.seq_id_end
_ihm_sphere_obj_site.asym_id
_ihm_sphere_obj_site.Cartn_x
_ihm_sphere_obj_site.Cartn_y
_ihm_sphere_obj_site.Cartn_z
_ihm_sphere_obj_site.object_radius
_ihm_sphere_obj_site.rmsf
_ihm_sphere_obj_site.model_id
1 1 1 10 A 124.581 133.035 33.707 5.992 . 1
2 1 11 20 A 131.316 145.472 45.948 5.992 . 1
3 1 21 30 A 137.311 152.343 61.883 5.992 . 1
4 1 31 40 A 134.626 152.521 80.483 5.992 . 1
```

This simply states that a sphere representing residues 1 through 10 in chain A is centered
at (124.581, 133.035, 33.707) and has radius 5.992, and so on.


## As Python objects {#python}

We can also read in mmCIF or BinaryCIF files using python-ihm's ``ihm.reader`` module and explore the data using the python-ihm API. For example, we can read the structure we just created:

In [None]:
import ihm.reader
with open('rnapoliii.cif') as fh:
    s, = ihm.reader.read(fh)
print(s.title, s.restraints, s.ensembles, s.state_groups)

Similarly we could explore any integrative model deposited in PDB-Dev. For example we can look at PDB-Dev \#14, a [HADDOCK](http://www.bonvinlab.org/software/haddock2.2/) model:

In [None]:
import urllib.request
with urllib.request.urlopen('https://pdb-dev.wwpdb.org/static/cif/PDBDEV_00000014.cif') as fh:
    s, = ihm.reader.read(fh)
print(s.title, s.restraints, s.ensembles, s.state_groups)

# Further reading {#further}

More tutorials on using IMP are available at [the IMP web site](https://integrativemodeling.org/tutorials/).