Protein selection for docking
Often you might have multiple crystal structures for a protein, each with different ligands present. This introduces an extra complexity when docking because you need to work out which one(s) are going to give you the best results. This is common in fragment screening where you might have multiple fragments crystalised into the same target protein. Those fragments may explore the extent of the cavity well, but if you are wanting to use docking to screen potential derivatives of those fragments then you need to determine what protein structure(s) you are going to use.
If you are lucky one protein may be good for docking a wide range of ligands. If you are unlucky no single protein may be good for all and you may need to use multiple proteins.
The approach taken here is to try to re-dock all of the ligands into all of the proteins and see how well each protein docks each ligand, based on RMSD values of the docked ligand and the coordinates of that ligand in its own crystal structure. Hopefully the protein which contained the ligand is a good receptor for that ligand, but is it also a good receptor for the other ligands.
RDock is used for the docking.
The data is a small subset of data from fragment screening of NUDT7A performed by the SGC at the Diamond Light Source's XChem project. More info on that target at the SGC can be found here.The 5 structures were selected and prepared by Anthony Bradley. The ligands were removed to generate apo structures and all waters were removed. Proteins are in PDB format, ligands in MOLFILE format. The proteins are already aligned and there is only small variation on the protein backbone between the different structures.
The ligands for NUDT7A-x0129, NUDT7A-x0254, NUDT7A-x0384 and NUDT7A-x0389 occupy a very similar region in the cavity whilst NUDT7A-x0151 occupies a different region with no real overlap.
1. copy files
Files are copied from the datasets section.
$ ./1_copy_files.sh Copying files ... Files copied. Files are: NUDT7A-x0129_apo.pdb NUDT7A-x0151_apo.pdb NUDT7A-x0254_apo.pdb NUDT7A-x0384_apo.pdb NUDT7A-x0389_apo.pdb NUDT7A-x0129.mol NUDT7A-x0151.mol NUDT7A-x0254.mol NUDT7A-x0384.mol NUDT7A-x0389.mol Creating SDF with all ligands ... ligands.sdf created Done
2. create cavity
How to define the cavity for each protein is a good question. As fragments are small each one only explores a relatively small part of the potential cavity. Any chemistry follow up is going to want to extend the fragment into this larger space. Hence restricting the cavity to the area around the ligand for that protein may be too restrictive.
RDock provides two ways of defining the cavity. The easier one is to use an existing ligand to define the region of the cavity. But doing this has the problems described above.
The second is to use the '2 sphere' approach to map out the cavity. That might be a good approach but here we're wanting to exploit the ligand information.
So which ligand to use? Why not use them all as that will nicely map out the region. The rDock rbcavity program only accepts a single ligand so we make this possible by creating a 'Frankenstein' ligand by combining the atoms of all ligands into a single molecule. This is done with a Perl script provided by Xavier Barril who has previously used this technique. The Frankenstein molecule has only atoms, and no bonds, and 'internal' atoms that do not contribute to the external surface are excluded. With this ligand we can create a cavity, and as the combined ligands should explore the cavity reasonably efficiently we can use a relatively small radius of 3A around those atoms. See the parameter file for details.
The cavity definition is generated for each protein with the 2_create_cavity.sh script which does these actions:
- Creates the actual parameters file for each protein from a template
- Prepares the protein in MOL2 format using OpenBabel. Note that this is quite a simplistic approach and we plan to come back to this later.
- Creates the cavity definition using rbcavity and the Frankenstein molecule.
Each protein is handled in its own directory.
$ ./2_create_cavity.sh Creating Frankenstein ligand ... V2000 V2000 V2000 V2000 V2000 Processing NUDT7A-x0129_apo.pdb ... Creating NUDT7A-x0129.mol2 1 molecule converted Creating cavity for NUDT7A-x0129 Processing NUDT7A-x0151_apo.pdb ... Creating NUDT7A-x0151.mol2 1 molecule converted Creating cavity for NUDT7A-x0151 Processing NUDT7A-x0254_apo.pdb ... Creating NUDT7A-x0254.mol2 1 molecule converted Creating cavity for NUDT7A-x0254 Processing NUDT7A-x0384_apo.pdb ... Creating NUDT7A-x0384.mol2 1 molecule converted Creating cavity for NUDT7A-x0384 Processing NUDT7A-x0389_apo.pdb ... Creating NUDT7A-x0389.mol2 1 molecule converted Creating cavity for NUDT7A-x0389
Docking is done using rDock. As the datasets are small we are not at this stage parallelising this.
Each of the 5 ligands is docked into each of the 5 proteins.
For each ligand-protein combination 50 dockings are performed generating a file named
and the results sorted by the docking score to create the file
Then RMSD values are calculated against the pose of the same ligand in its own crystal structure.
Then the lowset RMSD for each molecule in that target and its rank in the scoring is extracted in the file
Run this with the command:
RMSD Scores are as follows
A number of interesting conclusions can be made.
Firstly rDock seems to be good at generating the correct pose, but very often that pose is not the best score. Very often the best RMSD is ranked quite lowly. We do need those 50 dockings to be sure of getting the possibility of a good pose.
This is seen for NUDT7A-x0389 where the ligand is consistently docked into the same space but two ways round, with the wrong way scoring significantly better than the right way. In the figure below the ball and stick representation is the ligand in the crystal structure and the two licorice representation are the best scoring docking pose and the pose with the best RMSD. The one with the best RMSD ranks 41st in the scoring list! The best scoring pose overlays very well but is the wrong way round.
The overall picture for the best RMSD scores looks like this (rows are ligands, columns are proteins).
The ranks look like this:
Clearly the 'best' pose as determined by RMSD to the crystal structure ligand often ranks quite low. But despite this rDock is able to find the 'correct' pose in all cases.
Clearly we should not trust the docking scores and need to look at alternative ways to score the docked posed.
Clearly we need to generate a reasonably large number of poses and not rely on the best scoring ones.
Overall the NUDT7A-x0129 protein seems the best target to use as the sum of the RMSD scores is lowest, but it does not handle 2 of the ligands, NUDT7A-x0151 and NUDT7A-x0384, well. The proteins for those two ligands do handle their own ligand well, but do not handle any of the other ligands well.
So if you wanted to choose only one taget to use you would select NUDT7A-x0129. If you wanted to be sure you had better coverage then you would also need to include NUDT7A-x0151 and NUDT7A-x0384.
Investigate better ways to prepare the protein.
Investigate the two spheres approach for cavity generation.
Investigate other docking algorithms.
Investigate clustering the docking poses and keeping the top scoring pose in each cluster.
Examine these docking poses with respect to the known hydrogen bonding and other interactions.
Investigate whether keeping specific water molecules in the protein structure can improve results.
Extend to other datasets.