# Preparing a protein-ligand system for molecular dynamics simulation.

This is part 1 of a three-part tutorial on molecular dynamics simulations of biomolecular systems prepared for *CompBioAsia* 2025.

This part is concerned with preparing a molecular system for MD simulation.
A second part covers actually running MD simulations, using the **AMBER** package.
A third part looks at another MD package - **OpenMM**.

## Prerequisites
Assuming you have started this Notebook using the `run_notebook.sh` script in this folder, your Python environment should be complete.

## Background
One of the most widespread uses of molecular dynamics simulations is to predict protein-ligand binding affinities, a key process in drug design and discovery. This requires:

1. A three dimensional model for the ligand - most commonly this can be **predicted from first principles**.
2. A three dimensional model for the protein - this may be predicted from first principles sometimes (e.g. **Alphafold** or similar), but most commonly makes use of experimental methods such as **Xray crystallography** or **NMR spectroscopy**.
3. A prediction of where on the protein the ligand binds - often obtained via the molecular modelling method of **Docking**.
4. The production of a molecular model for the protein-ligand complex, and conversion into a form that is ready for MD simulation.

This tutorial is concerned with parts 1,2, and 4 of this process - the Docking part will be discussed at another time.



* In the first part you will learn how to produce a molecular model for an analogue the anicancer drug [Imatinib](https://en.wikipedia.org/wiki/Imatinib). 

* In the second part you will learn how to produce a molecular model of the protein target, the [Abl tyrosine kinase](https://en.wikipedia.org/wiki/ABL_(gene)).

* In part three (skipping over the docking bit), you will see how to combine the model of the protein with the molecular model for the ligand generated in part one, and complete the preparation of the system for molecular dynamics simulation.

There are very many approaches to system preparation for MD, what you see here is just one of them. It leverages a number of different system preparation tools from different sources, so to make the process simpler these have been "wrapped" into a small number of Python functions in the package *cba_tools*.

If you want to see the details, take a look at `cba_tools.py`!

**Authors**:
This tutorial is adapted from CCPBioSim's [BioSim analysis workshop](https://github.com/CCPBioSim/BioSim-analysis-workshop).

*Updates*: Charlie Laughton (charles.laughton@nottingham.ac.uk)

## Part 1. Constructing a model for the ligand

We begin by creating a model for our chosen ligand. If you are starting - as here - from nothing, then a good way to do this can be to work out a description of its structure in [SMILES format](https://daylight.com/dayhtml/doc/theory/theory.smiles.html), and then apply tools that can generate a 3D model of the molecule from this.

In doing this, one of the things you need to consider carefuly is the likely **protonation state** of your ligand. For example, if it contains basic amino groups, most likely at physiological pHs these will be protonated. If on the other hand it contains carboxylic acid groups, most likely these will be deprotonated. 

If you are an experienced chemist you can write your SMILES string in a way that exactly specifies this, but if you are less confident, there are tools that can automate the process. That is what we do here, using the tool `smiles_to_pdb` from the CBA tools package.

### 1.1 Import required packages

In [None]:
from cba_tools import smiles_to_pdb
# For visualization of the results:
import nglview as nv
import mdtraj as mdt

### 1.2 From 1D to 3D

The structures of Imatinib, and our Imatinib analogue, are show below (can you spot the differences?):


![imatinib and analogue](imatinib_analogue.png)

We begin with a description of the molecular structure of our imatinib analogue in the form of a SMILES string (so effectively a '1D' representation of the  structure:

"c1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2ccncc2)n1"

Now we use the CBA tool `smiles_to_pdb` to convert it to a 3D representation, and save to disk as a PDB format file. We ask the tool to make sure the molecule is created in an ionization state appropriate for a physiological pH:

In [None]:
ligand_smiles = 'c1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2ccncc2)n1'
charge = smiles_to_pdb(ligand_smiles, 'ligand_pH7.pdb', pH=7.4)
print('3D structure created, formal charge = ',charge)

Let's take a look at the 3D structure, using `nglview`: 

In [None]:
traj = mdt.load('ligand_ph7.pdb')
view = nv.show_mdtraj(traj)
view

Take a careful look at the structure, hopefully you can convince yourself that you now have a chemically complete and structurally reasonable model for your imatinib analogue. For example, if you look carefully at the piperidine ring, you should see that both nitrogen atoms are protonated (this is why the molecule has a formal charge of +2).

With a structure for the ligand prepared, we can now move on in Part 2 to building a structure for the protein target.

## Part 2. Remediating a suitable protein structure obtained from the Protein Data Bank

The Protein Data Bank ([PDB](www.rcsb.org)) is a very valuable source of structures for MD simulation, but it must be understood that the crystal structure data itself is really just raw material - there are typically many steps that must be taken in order to generate simulation-ready systems from it. Some of these are:

1. The crystal structure may contain more data than is needed for the simulation (e.g. multiple copies of the protein) - it may need to be edited down.

2. Almost certainly the crystal structure will have missing data. It possible that certain heavy atoms - maybe whole sections of the protein - were not resolved in the experiment and are missing. Molecular simulations require chemically-complete models for the components so this must be rectified.

3. Even if the structure is complete at the heavy-atom level, if it was solved by Xray crystallography it is unlikely that any hydrogen atoms will have been resolved. so these missing atoms must be added as well.

### 2.1 Import required packages

In [None]:
from cba_tools import fix, add_h

### 2.2 From Xray crystallography data to a partial structural model

Although it's now often possible to obtain a (nearly) "ready to run" model for any protein via [Alphafold](https://alphafold.com/) (or similar), it still remains the case that if a good quality and relevant crystal structure is available from the Protein Data Bank this can produce a better starting model for a simulation.

It turns out that the crystal stucture of Abl in complex with Imatinib itself has been solved, with PDB code [2HYY](https://www.rcsb.org/structure/2HYY), so if we can extract just the protein component from this, it would seem a good place to start. 

A copy of this PDB file is included ('2hyy.pdb'),  step one is to take a look:

In [None]:
pdb2hyy = mdt.load('2hyy.pdb')
view = nv.show_mdtraj(pdb2hyy)
view.add_representation('ball+stick', 'water')
view

You should be able to work out that the crystal structure features four copies of the Abl protein, each with one molecule of Imatinib bound to it. Each has it's own collection of water molecules too. For now, we are going to assume that it's only the first of these copies (chain 'A') that we want to start building our simulation system.

What you probably can't see straight away is that there are some significant issues with this crystal structure. There are quite a few atoms that ought to be there, but aren't because they could not be seen in the experimental electron density. This results in some amino acid side chains being incomplete (e.g. a histidine sidechain - left panel below), and even some entire residues being absent, creating 'gaps' in the protein chain (right). 

|Missing side chain atoms|Missing residue|
|---------------------|------------------|
|![histidine](his.png)|![a gap](gap.png) |


For simulation purposes, these missing atoms must be reintroduced somehow. The CBA tool *fix*, which uses [pdbfixer](https://github.com/openmm/pdbfixer), can be used to do this. It requires the name of the PDB file to fix, a name for the 'fixed' file it will generate, a list of the chains to keep, and a decision as to whether any missing residues at the N- and C-terminii should be ignored (trim=True), or reconstructed as well (trim=False).

In the cell below you can see we have decided to use chain 'A", won't bother to reconstruct missing N- and C-terminal residues, and will call the remediated PDB file 'abl_imatinib_heavy.pdb' (because it should be complete at the heavy atom level, though still missing hydrogens):

In [None]:
fix('2hyy.pdb', 'abl_imatinib_heavy.pdb', keep_chains=['A'], trim=True)

The messages tell you that the most significant thing is that a missing residue (Glu at position 40) has been regenerated.

Let's take a look at the result:

In [None]:
abl_imatinib = mdt.load('abl_imatinib_heavy.pdb')
view = nv.show_mdtraj(abl_imatinib)
view

Take a good look, and convince yourself that the process has worked. Be aware that though `fix` works pretty well, it's not guaranteed to be perfect every time. You should always check the produced structure very carefully.

The protein chain looks complete, however the structure contains bits we don't want - the imatinib ligand, and also (maybe not easy to see) some water molecules.

We will use `mdtraj` to cut the model down to just the protein component, and save as a PDB format file. We'll also have a look to check:

In [None]:
abl = abl_imatinib.atom_slice(abl_imatinib.topology.select('protein'))
abl.save('abl_heavy.pdb')
view = nv.show_mdtraj(abl)
view.add_representation('licorice', 'protein')
view

If you zoom in you will be able to see there are no hydrogen atoms in this structure, just heavy atoms. The next step is to fix this.

### 2.3 Completion of the protein model - addition of hydrogen atoms.

There are a variety of methods to do this more-or-less automatically, but none is perfect - always check the results! Here we will use the CBA tool *add_h*, which in turn uses the molecular graphics package [Chimera](https://www.cgl.ucsf.edu/chimera/) (or the newer [ChimeraX](https://www.cgl.ucsf.edu/chimerax/)) and the "pdb4amber" utility from [ambertools](https://ambermd.org/AmberTools.php) . 

It requires are the names for the input and output PDB files, the command that will launch Chimera (or ChimeraX), and a decision as to whether names of protein residues should be adapted to fit **AMBER** conventions (we want this). In th next cell we run it then check the result visually:

In [None]:
add_h('abl_heavy.pdb', 'abl_amber.pdb', 
      chimera='chimerax',
      mode='amber')
abl_amber = mdt.load('abl_amber.pdb')
view2 = nv.show_mdtraj(abl_amber)
view2.add_representation('licorice', 'protein')
view2

You should be able to see the structure now includes all the expected hydrogen atoms.

With models for both the protein and ligand now made, the next step would most likely be to use **Docking** or a similar process to predict where in the structure of the protein the ligand binds. We are going to skip over that for now, assuming it has been done and a new model for the imatinib ligand is available: 'ligand_docked.pdb'. 

## Part 3: Preparing the protein-ligand system for MD with AMBER

In this next part we will combine the protein and docked ligand structures, generate a biologically more relevant model by immersing the protein-ligand complex in a bath of water and ions, and then generate the data files in a format required for the AMBER MD simulation package. This process is called "parameterization".

### 3.1 Import the required packages

In [None]:
from cba_tools import param

## 3.2 Merging the protein and ligand into one PDB file

First we need to merge the structures of the protein and ligand into one file. We can use `mdtraj` for this:

In [None]:
abl = mdt.load_pdb('abl_amber.pdb', standard_names=False) # Keep AMBER-compliant names
ligand = mdt.load_pdb('ligand_docked.pdb')
abl_ligand = abl.stack(ligand)
abl_ligand.save('abl_ligand.pdb')
view = nv.show_mdtraj(abl_ligand)
view

If you click on one of the ligand atoms, you will see it has the residue name 'UNL'. Remember too from part 1 that this ligand has a formal charge of +2. You will need both of these bits of information in a bit.

## 3.3 Completion and parameterizion of the molecular system

Now that we have a chemically-complete model for the protein and ligand, we can move on to the parameterization stage.

### Gathering required information

For this we will be using tools from the [AMBER MD](https://ambermd.org/) simulation package. Parameterizing the protein component of the system is easy, beacuse AMBER comes with a library of parameters for all "standard" biomolecular components (amino acids, nucleic acids, certain ions, solvants and lipids, etc.). But it has no knowledge of the parameters required for the Imatinib molecule in our system so we have to generate these ourselves.

The CBA tool *param* will do this for you. It requires:

 - the name of the PDB format file to process
 - the name of the AMBER parameter ("prmtop") file to generate
 - the name of the AMBER coordinates ("inpcrd") file to generate
 - the names of all non-standard residues ("heterogens") that will need to be parameterized
 - the formal charge on each of the heterogens
 - the type of solvent (water) box to add (see below)
 - the width of the solvent margin between the solute and the box boundaries

In addition, for more advanced use you can specify which forcefields you want to be used (otherwise defaults are selected automatically).

### Deciding about solvation

The options for the periodic box of solvent are "box", "cube", and "oct" (truncated octahedron). The figure below summarizes the differences:

![boxes](boxes.png)

"Box" adds the least solvent to satisfy the "buffer" criterion (white arrows), but if the solute (orange) rotates in the box, it may extend beyond it. "Cube" solves this, but means adding more water (so more atoms and a slower simulation). "Oct" reduces the number of waters required but still is safe for rotation of the solute.

### Running the parameterization

With a decision made about this, it's time to run the parameterization process. This may take quite a long time, because part of the process may involve running a Quantum Mechanics (QM) calculation.

In [None]:
param('abl_ligand.pdb', 'abl_ligand.prmtop', 'abl_ligand.inpcrd', 
      het_names=['UNL'], het_charges=[2],
      solvate='oct', buffer=10.0)

Visualize the result, which is your 'simulation-ready' system:

In [None]:
system = mdt.load('abl_ligand.inpcrd', top='abl_ligand.prmtop')
view3 = nv.show_mdtraj(system)
view3.add_representation('line', 'HOH')
view3

Your system preparation process is complete!

## To recap:

1. You built a 3D model for your chosen ligand, starting from a SMILES string.
2. You built a 3D model for your protein target - Abl - by remediating a structure obtained from the Protein Data Bank (PDB).
3.  You skipped over the part where the protein and ligand are "docked' together.
4.  You combined protein and docked ligand, added water and ions to give a biologically relevant system, and then generated the neccessary structure and parameter files (`abl_ligand.inpcrd` and `abl_ligand.prmtop` ) for the MD simulation program AMBER.


You are ready to start running some MD simulations!