# Introduction

To conduct a classical MD simulation, we need to specify the properties of the various atoms in our structure, the interaction potentials between atom tuples, and the external potentials. These will be described by the forcefield files. OpenMM uses a XML file format for forcefields, and details about the format can be found on their online documentation. Once we have forcefields for all components of our system topology, OpenMM will first map each residue in our topology to the residue templates in the forcefields to assign atom properties. Next, the various bonds, defined by the connected atoms, bond type, and bonding parameters will be constructed based on the atom classes and forcefield specification. 

CRO, short for chromophore, refers to the 4-(p-hydroxybenzylidene)imidazolin-5-one formed from three intrinsic residues (Ser65, Tyr66 and Gly67) in the polypeptide chain by a post-translational maturation process. Clearly, it's a non-canonical amino acid (NCAA), and there's no residue template for it in the Amber `ff14sb` forcefield we plan to use. Also, there are some new bond types that are not parametrized in the `ff14sb` forcefield, given the unique structure of the CRO residue. The goal of this notebook is to adapt an existing [Amber tutorial](http://ambermd.org/tutorials/basic/tutorial5/) to explain how to parametrize our cationic chromophore correctly and export the forcefield to an XML file accepted by OpenMM. 

Before starting, please make sure you have the `CRO.cif` and the `cro.mc` files in the folder. Also, since we will rely on the AmberTools software suite, please check your notebook kernel is set to use the correct conda environment.

Note, parametrizing an NCAA is slightly more challenging than parametrizing a standalone ligand molecule, because the NCAA is connected to the amino acids before and after it in the peptide chain, and we need to take care of issues with clipping the termini atoms when preparing the forcefield.

The following are the steps that we will follow:

1. Computing partial charges and atom types of the custom residue
2. Preparing the force field parameters
3. Exporting the frcmod files to OpenMM XML file.

# Step 1. Computing partial charges and atom types for CRO

The starting CRO template `CRO.cif` comes from `components.cif` provided [here](https://www.wwpdb.org/data/ccd) by wwPDB. It contains the idealized geometry of the molecule which we will use to compute partial charges and infer atom types. For this step, we will use `antechamber`.  `antechamber` was first written to be used along with the general AMBER force field (GAFF). GAFF contains many more atom types so better covers the organic chemical space, and it's fully compatible with AMBER forcefields because it uses lowercase letters to denote atom types, avoiding conflict with AMBER's uppercase convention. Anyway, `antechamber` is a very versatile program that can perform many file conversions, and it can also assign atomic charges and atom types, which is the main function we will rely on today. Please refer to the "Antechamber and GAFF" chapter of the reference manual for details. 

The current step will take `CRO.cif` as the input and produce `cro.ac` with Amber atom types and partial charges calculated using the BCC scheme assigned to the atoms.

Note, we do not strip off the tyrosyl hydrogen before this step, or the terminal hydrogen and hydroxyl groups, as I observe that do not quite give the correct result. Also, the quantum chemistry program called by `antechamber` will produce a bunch of auxiliary files. We don't really to look at these.

In [1]:
import subprocess

# -fi input file format
# -i  input file name
# -fo output file format
# -o  output file name
# -c  charging scheme
# -at atom type

subprocess.run('antechamber -fi ccif -i CRO.cif -bk CRO -fo ac -o cro.ac -c bcc -at amber'.split())


Welcome to antechamber 21.0: molecular input file processor.

acdoctor mode is on: check and diagnose problems in the input file.
The atom type is set to amber; the options available to the -at flag are
    gaff, gaff2, amber, bcc, and sybyl.
-- Check Unusual Elements --
   Status: pass
-- Check Open Valences --
   Status: pass
-- Check Geometry --
      for those bonded   
      for those not bonded   
   Status: pass
-- Check Weird Bonds --
   Status: pass
-- Check Number of Units --
   Status: pass
acdoctor mode has completed checking the input file.

Info: Total number of electrons: 168; net charge: 0

Running: /Users/ziyuanzhao/opt/anaconda3/envs/AmberTools21/bin/sqm -O -i sqm.in -o sqm.out



CompletedProcess(args=['antechamber', '-fi', 'ccif', '-i', 'CRO.cif', '-bk', 'CRO', '-fo', 'ac', '-o', 'cro.ac', '-c', 'bcc', '-at', 'amber'], returncode=0)

Next we need to fix an atom type in the `cro.ac` file - we change `NT` to `N` to indicate it's not actually the terminal atom. This allows Amber's forcefield to connect the residue to the previous C terminal and add the correct bonds.

In [2]:
with open('cro.ac') as f:
    newText=f.read().replace('NT', ' N')

with open('cro.ac', "w") as f:
    f.write(newText)

# Step 2: Preparing the force field parameters

Recall from our last notebook that we will set the chromophore in our eGFP starting structure to be in the cationic state, so we need to strip off a proton. Also, we need to strip off `HXT`, `OXT` and `H2` as these are the terminal atoms that are removed by hydrolysis once the amino acid is joined up in a chain. This is handled by AmberTools `prepgen` program which takes a mainchain (`.mc`) file and outputs a prepped input (`.prepi`) file (this really is an archaic file format, somebody should consider writing a newer version of this tool?). The mainchain file is provided in the folder as `cro.mc`. Note that it is slightly different from the one used in Amber's tutorial.



In [3]:
# -m  mainchain file name
# -rn residue name
subprocess.run('prepgen -i cro.ac -o cro.prepin -m cro.mc -rn CRO'.split())


PRE_HEAD_TYPE is     C
POST_TAIL_TYPE is     N
Net charge of truncated molecule is    -1.00
HEAD_ATOM      1   N1
TAIL_ATOM     13   C3
MAIN_CHAIN     1    1   N1
MAIN_CHAIN     2    2  CA1
MAIN_CHAIN     3    6   C1
MAIN_CHAIN     4    8   N3
MAIN_CHAIN     5   12  CA3
MAIN_CHAIN     6   13   C3
OMIT_ATOM      1   25   H2
OMIT_ATOM      2   34  HXT
OMIT_ATOM      3   23  OXT
OMIT_ATOM      4   40  HOH
Number of mainchain atoms (including head and tail atom):     6
Number of omited atoms:     4
Info: There is a bond linking a non-head and non-tail residue atom (OH) and an omitted atom (HOH).
      You need to specifically add this bond in LEaP using the command 'bond <atom1> <atom2> [order]'
      to link OH to an atom in another residue (similar to disulfide bonds)!


CompletedProcess(args=['prepgen', '-i', 'cro.ac', '-o', 'cro.prepin', '-m', 'cro.mc', '-rn', 'CRO'], returncode=0)

After the above step we will get a `cro.prepin` file that converts essentially the same information as the `cro.ac` file, except that the atoms we want to ignore and charge redistribution have been handled properly. The next step would be to compare the bonds described by the molecular topology in this file to Amber's databases `parm10.dat` and the more comprehensive `gaff.dat` using `parmchk2`. 

Now here's a huge difference in the design philosophy behind Amber forcefields and OpenMM forcefields. The former are designed to be incremental, i.e., we can load a very basic forcefield like `parm10.dat` and then load more modifications to that forcefield, which provides additional atom and bond definitions. Think of patches. In fact, `ff14sb` as in Amber is stored as a `.frcmod` (forcefield modification) file. However, in OpenMM, forcefields can feel more monolithic, even though there's flexibility in referring to atom types in the previously imported forcefields and overriding residue definitions, standard (harmonic, dihedral, improper) bonds can only be defined once, and beyond that it's just undefined behavior. (That's quite a fine print they didn't explain in the main documentation!) For more discussions on OpenMM's ideology with forcefields read [here](https://github.com/openmm/openmm/issues/2481#issuecomment-557921856) and [here](http://docs.openmm.org/latest/userguide/application/05_creating_ffs.html) (section 6.3). 

So with this understanding, I can better explain the current step. Our ultimate plan is to use as many parameters as possible from the `ff14sb` forcefield, but clearly, some parameters for the chromophore are not described by `ff14sb`. Then we want to use as many similar parameters as possible from the `parm10` forcefield that underlies `ff14sb`. For the few parameters that do not have similar matches in `parm10` we will resort to the most general `gaff` forcefield. This ensures that our parameters are as consistent as they could be within our model system for eGFP crystal. `parmchk` is exactly the tool to do this. With an `-a` flag, it will generate a `.frcmod` file containing all similar parameters in the requested database and those parameters for which we can't find good matches (marked by "ATTN"). After this step we will get two files, `cro1.frcmod` based on `parm10` and `cro2.frcmod` based on `gaff`.

Note, please make sure you have set `$AMBERHOME` to point at your AmberTools installation folder.

In [10]:
subprocess.run('parmchk2 -i cro.prepin -f prepi -o cro.frcmod -a Y -p $AMBERHOME/dat/leap/parm/parm10.dat', shell=True)
subprocess.run('grep -v "ATTN" cro.frcmod > cro1.frcmod', shell=True)
subprocess.run('parmchk2 -i cro.prepin -f prepi -o cro2.frcmod', shell=True) # no -p defaults to gaff

CompletedProcess(args='parmchk2 -i cro.prepin -f prepi -o cro2.frcmod', returncode=0)

# 3. Exporting the frcmod files to OpenMM XML file
Finally, we will export the frcmod files. For this, we will use `parmed`, a python library for aiding in investigations of biomolecular systems using popular molecular simulation packages, like Amber, CHARMM, and OpenMM written in Python. They have a subset of tools for juggling the different forcefield file formats as required by the MD packages. We will skip the tutorial for now and just run the code that do the job of combining these frcmod files and spitting out the XML file for us. Beware, the order we load is important. We want to use `gaff` parameters first and then overwrite as many of these as possible with `parm10` parameters.

In [59]:
import parmed as pmd
import xml.etree.ElementTree as ET

ff_input = ['cro2.frcmod', 'cro1.frcmod']
top_input = 'cro.mol2'

# prepare mol2 from cif, note this is the structure after stripping atoms
subprocess.run('rm leaprc; echo "loadAmberPrep cro.prepin\nsaveMol2 CRO cro.mol2 1\nquit" > leaprc', shell=True)
subprocess.run('tleap')

# amber -> openmm pipeline
ff = pmd.openmm.OpenMMParameterSet.from_parameterset(
    pmd.amber.AmberParameterSet(ff_input)
)

# adds residue template
mol2 = pmd.load_file(top_input)
ff.residues[mol2.name] = mol2

# export and modify
ff.write('cro_.xml')

# note, ET produces slightly awkward formatting, not a big deal tbh
cro = ET.parse('cro.xml')
root = cro.getroot()
atomtypes = root.findall('AtomTypes')[0]
for atomtype in atomtypes.findall('Type'):
    if atomtype.attrib.get('class') != 'CD':
        atomtypes.remove(atomtype)
ET.SubElement(root, 'Include').set('file','amber/ff14SB.xml')
residue = root.findall('Residues')[0].findall('Residue')[0]
ET.SubElement(residue, 'ExternalBond').set('atomName','N1')
ET.SubElement(residue, 'ExternalBond').set('atomName','C3')
cro.write('cro.xml')

-I: Adding /Users/ziyuanzhao/opt/anaconda3/envs/AmberTools21/dat/leap/prep to search path.
-I: Adding /Users/ziyuanzhao/opt/anaconda3/envs/AmberTools21/dat/leap/lib to search path.
-I: Adding /Users/ziyuanzhao/opt/anaconda3/envs/AmberTools21/dat/leap/parm to search path.
-I: Adding /Users/ziyuanzhao/opt/anaconda3/envs/AmberTools21/dat/leap/cmd to search path.

Welcome to LEaP!
Sourcing leaprc: ./leaprc
Loading Prep file: ./cro.prepin
Writing mol2 file: cro.mol2
	Quit



Some additional explanations about the additional steps after writing out the XML file from `parmed` might be helpful. First, since OpenMM does not allow duplicate atom names, we must cull all names that appeared in the `ff14sb` forcefield that we are going to use later, so only the `CD` atom name is new and will be included in our XML file. We also explicitly declare that we want to import `amber/ff14SB.xml` when this XML is loaded for these pre-existing atom definitions. Another tricky point is that we must add two `ExternalBond` tags to the residue template so that OpenMM can add relevant bonds connecting CRO to the amino acids before and after, even if we don't specify them explicitly in our XML file.  