# Topology Related Functions and Attributes

In this tutorial, we are going to show how to generate topology for chains and the objects that associated with it.

In [1]:
from crimm.Modeller import TopologyLoader, ParameterLoader
from crimm.Modeller.TopoFixer import fix_chain, ResidueFixer
from crimm.Visualization import View, show_nglview_multiple
from crimm.Fetchers import fetch_rcsb



In [2]:
structure = fetch_rcsb('1DFU', include_solvent=False) 
# The warnings are expected, since all connection record involving water will be skipped



In [3]:
model = structure.models[0]
model

NGLWidget()

<Model id=1 Chains=8>
	│
	├───<Polyribonucleotide id=A Residues=19>
	├──────Description: 5S RRNA
	│
	├───<Polyribonucleotide id=B Residues=19>
	├──────Description: 5S RRNA
	│
	├───<Polypeptide(L) id=C Residues=94>
	├──────Description: RIBOSOMAL PROTEIN L25
	│
	├───<Heterogens id=D Molecules=1>
	├──────Description: MAGNESIUM ION
	│
	├───<Heterogens id=E Molecules=1>
	├──────Description: MAGNESIUM ION
	│
	├───<Heterogens id=F Molecules=1>
	├──────Description: MAGNESIUM ION
	│
	├───<Heterogens id=G Molecules=1>
	├──────Description: MAGNESIUM ION
	│
	├───<Heterogens id=H Molecules=1>
	├──────Description: MAGNESIUM ION


In [4]:
# quick check if there is any fragmented chain
for chain in model:
    if not chain.is_continuous:
        print(chain)

## Load Topology and Parameter

The example below is the temporary workflow for generating topology and parameter. The process will be streamlined very soon. 
**NOTE**: DNA chains are not supported yet but will be supported in the future.

In [5]:
# get protein definitions
topo_p = TopologyLoader('protein')
param_p = ParameterLoader('protein')
param_p.fill_ic(topo_p)

# get RNA definitions
topo_r = TopologyLoader('nucleic')
param_r = ParameterLoader('nucleic')
param_r.fill_ic(topo_r)

The main method from the `TopologyLoader` is `generate_chain_topology()` where the identity of the terminal patches are defined. The `coerce` flag is used to apply canonical residue topology definitions to any modified ones to convert them back later when fixing the residues.

In [6]:
protein_chains = []
rna_chains = []
for chain in model:
    if chain.chain_type == 'Polypeptide(L)':
        topo_p.generate_chain_topology(
            chain, first_patch='ACE', last_patch='CT3', coerce=True
        )
        protein_chains.append(chain)
    elif chain.chain_type == 'Polyribonucleotide':
        topo_r.generate_chain_topology(chain)
        rna_chains.append(chain)

# fill ic again since we have generated patched residue definitions
param_p.fill_ic(topo_p)
param_r.fill_ic(topo_r)



In [7]:
protein_chains[0]

NGLWidget()

<Polypeptide(L) id=C Residues=94>
  Description: RIBOSOMAL PROTEIN L25


## Inspecting Individual Residues

In [8]:
first_res = protein_chains[0].residues[0]
first_res

NGLWidget()

<Residue MET het=  resseq=1 icode= >


The `missing_atoms` show what is currently missing comparing to the residue topology definition loaded onto the residue. In the case below, the missing atoms to be built correspond to the N-terminal acetylation that we specified in `first_patch='ACE'` in the generation function. 

The `+N` and `+C` refers to missing neighbor atoms, since this is the first residue in the chain, these missing atoms are here for a placeholder purpose. Any neighbor atoms (starts with '+' or '-') will not be built based on the current residue definition, but they will be built in their owining residues if they exist.

In [9]:
first_res.missing_atoms

{'CAY': <MissingAtom CAY>,
 'CY': <MissingAtom CY>,
 'OY': <MissingAtom OY>,
 '+N': None,
 '+CA': None}

Hydrogen atoms can also be built from the residue topology definitons.

In [10]:
first_res.missing_hydrogens

{'HN': <MissingAtom HN>,
 'HA': <MissingAtom HA>,
 'HB1': <MissingAtom HB1>,
 'HB2': <MissingAtom HB2>,
 'HG1': <MissingAtom HG1>,
 'HG2': <MissingAtom HG2>,
 'HE1': <MissingAtom HE1>,
 'HE2': <MissingAtom HE2>,
 'HE3': <MissingAtom HE3>,
 'HY1': <MissingAtom HY1>,
 'HY2': <MissingAtom HY2>,
 'HY3': <MissingAtom HY3>}

## Topology Elements

By topology elements, we mean the geometry elements such as bond, angle, dihedral, impropers in the topology. They all have direct object handle in the `TopologyElementContainer`.

**Note**: cmap (cross-term correction map) has not been fully implemented. The calculation for angle values for dihedral and improper have not been implemented and will be shown as 0.00.

In [11]:
prot_chain = protein_chains[0]
prot_chain.topo_elements

<TopologyElementContainer for <Polypeptide(L) id=C Residues=94> with bonds=1555, angles=2827, dihedrals=4114, impropers=270, cmap=0>

The elements can be accessed as attributes `bonds`, `angles`, `dihedrals`, and `impropers`. Since they are enumerated from the molecular graph with topology definition, we know what bonds, angles, dihedrals, etc are supposed be here, if any atom is missing from the residue, the atom will be colored in red.

In [12]:
prot_chain.topo_elements.bonds[:5]

[<Bond(  CB,   CA) type=single length=1.52>,
 <Bond(  CG,   CB) type=single length=1.51>,
 <Bond(  SD,   CG) type=single length=1.78>,
 <Bond(  CE,   SD) type=single length=1.81>,
 <Bond(   N, [91m  HN[0m) type=single>]

In [13]:
angles10 = prot_chain.topo_elements.angles[:10]
angles10

[<Angle([91m HB1[0m,   CB, [91m HB3[0m)>,
 <Angle(  CB,   CG, [91m HG2[0m)>,
 <Angle(  CA,    C,    O) angle=120.68>,
 <Angle([91m  HB[0m,   CB,  CG2)>,
 <Angle([91mHG23[0m,  CG2, [91mHG21[0m)>,
 <Angle(  CB,  CG2, [91mHG23[0m)>,
 <Angle(  CE,   CD, [91m HD2[0m)>,
 <Angle(  CB,  CG1, [91mHG12[0m)>,
 <Angle(  CA,    C,    O) angle=120.41>,
 <Angle(  CD,   CE, [91m HE2[0m)>]

The atoms in the element can be directly accessed

In [14]:
angle = angles10[0]
print(angle)
a1, a3, a3 = angle
a1.get_full_id()

<Angle([91m HB1[0m,   CB, [91m HB3[0m)>


('1DFU', 1, 'C', (' ', 28, ' '), ('HB1', ' '))

In [15]:
# The residue itself can be accessed
a1.parent

NGLWidget()

<Residue ALA het=  resseq=28 icode= >


## Residue Fixer

The `ResidueFixer` class is designed to build missing atoms in residues. The residue will be fixed based on the loaded residue topology definition `ResidueDefinition` class. There are four main methods of fixing a residue in `ResidueFixer` class: 
1. `build_missing_atoms()` for building any missing heavy atoms
2. `build_hydrogens()` for building the **missing** hydrogens only
3. `rebuild_hydrogens()` to remove all hydrogens on the residue and rebuild them based on the residue topology definitions
4. `remove_undefined_atoms()` to remove any atoms that is not in the definition

In [16]:
fixer = ResidueFixer()
fixer.load_residue(first_res)
built_atoms = fixer.build_missing_atoms()
built_hydrogens = fixer.build_hydrogens()

In [17]:
first_res

NGLWidget()

<Residue MET het=  resseq=1 icode= >


In [18]:
built_atoms

[<Atom CY>, <Atom CAY>, <Atom OY>]

In [19]:
built_hydrogens

[<Atom HN>,
 <Atom HA>,
 <Atom HB1>,
 <Atom HB2>,
 <Atom HG1>,
 <Atom HG2>,
 <Atom HE1>,
 <Atom HE2>,
 <Atom HE3>,
 <Atom HY1>,
 <Atom HY2>,
 <Atom HY3>]

## Building Missing Atoms on the Entire Chain

The `fix_chain` is the temperory solution for repairing all the existing residues in a chain. By default, it will build missing atoms and missing hydrogens. However, a more robust and flexible `ChainFixer` class will be implemented shortly.

In [20]:
print(f'{prot_chain} has {len(list(prot_chain.get_atoms()))} atoms BEFORE fix')

<Polypeptide(L) id=C Residues=94> has 767 atoms BEFORE fix


In [21]:
built_atoms = fix_chain(prot_chain)

In [22]:
prot_chain.residues[-1]

NGLWidget()

<Residue ALA het=  resseq=94 icode= >


In [23]:
print(f'{prot_chain} has {len(list(prot_chain.get_atoms()))} atoms AFTER fix')

<Polypeptide(L) id=C Residues=94> has 1542 atoms AFTER fix


In [24]:
for chain in rna_chains:
    fix_chain(chain)
# since we did not specify any patch on the RNA chains, a warning will be given



In [25]:
rna_chains[0].residues[0]

NGLWidget()

<Residue C het=  resseq=1 icode= >


## More on the `TopologyDefinition` Class

We create a `TopologyDefinition` when we call `TopologyLoader` and a `ParameterDict` from `ParameterLoader`

In [26]:
topo_p

<TopologyLoader Ver=36.2 Contains 24 RESIDUE and 24 PATCH definitions>

In [27]:
param_p

<ParameterDict Bond: 132, Angle: 370, Urey Bradley: 113, Dihedral: 558, Improper: 35, CMAP: 6, Nonbond: 54, Nonbond14: 13, NBfix: 1>

Each individual residue definition can be accessed by the three-letter code

In [28]:
topo_p['ALA']

<Residue Definition name=ALA code=A atoms=10>

In [29]:
topo_r['CYT'] # in case of nucleic acids, one-letter code also works. e.g. topo_r['C']

<Residue Definition name=CYT  atoms=31>

Since the topology definition has internal coordinate (ic) table, a reference residue can be built directly from the `ResidueDefinition` object. As a matter of fact, the `SeqChainGenerator` uses this function to construct chain fro sequences

In [30]:
ref_res = topo_p['ALA'].create_residue()
ref_res

NGLWidget()

<Residue ALA het=  resseq=0 icode= >


### Other Topology Definitions

Since CHARMM36 has a breadth of topology and parameter types. We try to implement and utilize these definitions as much as possible. We have varying level of supports for many definitions such as lipids, ethers, carbs, and we aim to support ***cgenff***  and ***water*** topology and parameter set for small molecule parameterization soon in the future.

In [31]:
topo_lipids = TopologyLoader('lipid')
param_lipids = ParameterLoader('lipid')
param_lipids.fill_ic(topo_lipids)

In [32]:
topo_lipids.residues[:5]

[<Residue Definition name=LPPC  atoms=70>,
 <Residue Definition name=DLPC  atoms=106>,
 <Residue Definition name=DLPE  atoms=97>,
 <Residue Definition name=DLPS  atoms=99>,
 <Residue Definition name=DLPA  atoms=88>]

In [33]:
topo_lipids['LPPC'].create_residue()

NGLWidget()

<Residue LPPC het=  resseq=0 icode= >


## Parameters

individual parameter values can be accessed from `ParameterDict`

In [34]:
param_p['nonbonded']

{'C': nonbond_param(epsilon=-0.11, rmin_half=2.0),
 'CA': nonbond_param(epsilon=-0.07, rmin_half=1.9924),
 'CC': nonbond_param(epsilon=-0.07, rmin_half=2.0),
 'CD': nonbond_param(epsilon=-0.07, rmin_half=2.0),
 'CE1': nonbond_param(epsilon=-0.068, rmin_half=2.09),
 'CE2': nonbond_param(epsilon=-0.064, rmin_half=2.08),
 'CP1': nonbond_param(epsilon=-0.02, rmin_half=2.275),
 'CP2': nonbond_param(epsilon=-0.055, rmin_half=2.175),
 'CP3': nonbond_param(epsilon=-0.055, rmin_half=2.175),
 'CPH1': nonbond_param(epsilon=-0.05, rmin_half=1.8),
 'CPH2': nonbond_param(epsilon=-0.05, rmin_half=1.8),
 'CS': nonbond_param(epsilon=-0.11, rmin_half=2.2),
 'CPT': nonbond_param(epsilon=-0.099, rmin_half=1.86),
 'CY': nonbond_param(epsilon=-0.073, rmin_half=1.99),
 'CAI': nonbond_param(epsilon=-0.073, rmin_half=1.99),
 'CT': nonbond_param(epsilon=-0.02, rmin_half=2.275),
 'CT1': nonbond_param(epsilon=-0.032, rmin_half=2.0),
 'CT2': nonbond_param(epsilon=-0.056, rmin_half=2.01),
 'CT2A': nonbond_param(eps

Parameters can be obtained by providing atom type names

In [35]:
param_p.get_bond(('NH2', 'CT1'))

bond_param(kb=240.0, b0=1.455)

Reversed ordering is also accepted

In [36]:
param_p.get_bond(('CT1', 'NH2'))

bond_param(kb=240.0, b0=1.455)

Or we can get the values by providing the actual topology element

In [37]:
print(angle)
param_p.get_from_topo_element(angle)

<Angle([91m HB1[0m,   CB, [91m HB3[0m)>


angle_param(ktheta=35.5, theta0=108.4)