# Playing with Molecule objects in HTMD

Assuming that you have already downloaded and installed htmd, this tutorial guides you through the basic language features.

After importing, any object and function defined by htmd is available in the workspace. In this tutorial, we'll look more carefully into the Molecule class features.

The `htmd.config` sets up the viewer to NGL, a WebGL molecule viewer.

In [1]:
from htmd import *
htmd.config(viewer='ngl')

Videos from the HTMD2015 workshops are available on the Acellera youtube channel: https://www.youtube.com/user/acelleralive

You are on the latest HTMD version (1.0.11).


## Molecule objects

First, create an empty molecule object by either:
* Fetching it from the Protein Data Bank, by using its PDB code...
* ...or from a local file (many formats supported, including pdb, mol2, xtc, psf, prmtop)

In [2]:
mol = Molecule('3PTB')

## Inspect your molecule

Printing the object shows its properties.

In [3]:
print(mol)

Molecule with 1701 atoms and 1 frames
PDB field - altloc shape: (1701,)
PDB field - beta shape: (1701,)
PDB field - chain shape: (1701,)
PDB field - charge shape: (1701,)
PDB field - coords shape: (1701, 3, 1)
PDB field - element shape: (1701,)
PDB field - insertion shape: (1701,)
PDB field - name shape: (1701,)
PDB field - occupancy shape: (1701,)
PDB field - record shape: (1701,)
PDB field - resid shape: (1701,)
PDB field - resname shape: (1701,)
PDB field - segid shape: (1701,)
PDB field - serial shape: (1701,)
bonds shape: (42, 2)
box shape: (3, 1)
fileloc shape: (1, 2)
frame: 0
masses shape: (1701,)
reps: 
ssbonds shape: (0,)
step shape: (0,)
time shape: (0,)
topoloc: /data/joao/maindisk/WORKBENCH/htmd_tutorial/playing_with_htmd_molecule/3PTB
viewname: 3PTB


## Methods and properties of Molecule objects

Each `Molecule` object has a number of methods (operations that you can perform on the molecule) and properties (data associated to the molecule). Some of the properties correspond to data which is usually found in PDB files.

|Methods | Properties |
|--------|------------|
|read()  |record|
|write() |serial|
|get()   |name|
|set()   |resid|
|atomselect()|chain|
|filter()|coords|
|remove()|box|
|insert()|reps|
|view()  |...|
|wrap()  | |
|align() | |

Properties can be accessed either directly, or via the `Molecule.get` method. For example:

In [4]:
mol.serial

array([   1,    2,    3, ..., 1700, 1701, 1702])

In [5]:
mol.get("serial")

array([   1,    2,    3, ..., 1700, 1701, 1702])

Similarly, they can be modified directly, or via the `Molecule.set` method. This pair of methods is known as "getter/setter" methods in the object-oriented jargon.
The following sections will show the usage of property getters and setters in a number of real-world tasks.

### Check the resIds of the cystein residues present in your protein 

In order to get the residue IDs of cystein residues in the molecule, one can do:

In [6]:
mol.get('resid',sel='resname CYS')

array([ 22,  22,  22,  22,  22,  22,  42,  42,  42,  42,  42,  42,  58,
        58,  58,  58,  58,  58, 128, 128, 128, 128, 128, 128, 136, 136,
       136, 136, 136, 136, 157, 157, 157, 157, 157, 157, 168, 168, 168,
       168, 168, 168, 182, 182, 182, 182, 182, 182, 191, 191, 191, 191,
       191, 191, 201, 201, 201, 201, 201, 201, 220, 220, 220, 220, 220,
       220, 232, 232, 232, 232, 232, 232])

Note how residue IDs are outputted multiple times. This is due to the fact that one value is returned per matched atom, and this PDB file has approximately 6 atoms resolved per cystein residue.

The atom names of cystein residue 58 can be checked with:

In [7]:
mol.get('name','resname CYS and resid 58')

array(['N', 'CA', 'C', 'O', 'CB', 'SG'], dtype=object)

To obtain one residue ID per residue, one can either further restrict the selection to carbon &alpha; atoms:

In [8]:
mol.get('resid',sel='name CA and resname CYS')

array([ 22,  42,  58, 128, 136, 157, 168, 182, 191, 201, 220, 232])

or use numpy's `unique` function to remove repeated entries:

In [9]:
np.unique(mol.get('resid',sel='resname CYS'))

array([ 22,  42,  58, 128, 136, 157, 168, 182, 191, 201, 220, 232])

### Retrieve the coordinates of atoms

This is done accessing the `Molecule.coords` property. It is special, in the sense that it returns a 3-column vector (for the three coordinates). Also note how its precision is restricted to the one in the PDB file.

In [10]:
mol.get('coords','resname CYS and resid 58 and name CA')

array([  4.23999977,  16.49500084,  27.98600006], dtype=float32)

What is returned if more than one atom is selected?  A matrix.

In [11]:
mol.get('coords','resname CYS and resid 58')

array([[  5.12200022,  16.71899986,  26.86300087],
       [  4.23999977,  16.49500084,  27.98600006],
       [  4.87400007,  16.95800018,  29.29999924],
       [  4.23799992,  16.76399994,  30.36199951],
       [  3.94099998,  14.9989996 ,  28.07099915],
       [  2.79200006,  14.45199966,  26.72200012]], dtype=float32)

### Display the chains or segments present in your PDB file

The chains present in the `Molecule` can be known using:

In [12]:
np.unique(mol.get('chain'))

array(['A'], dtype=object)

which means that every atom is assigned to the same chain.

### List atoms recognized as water

Get the indices of atoms that were recognized as water:

In [13]:
mol.get("serial",sel="water")

array([1641, 1642, 1643, 1644, 1645, 1646, 1647, 1648, 1649, 1650, 1651,
       1652, 1653, 1654, 1655, 1656, 1657, 1658, 1659, 1660, 1661, 1662,
       1663, 1664, 1665, 1666, 1667, 1668, 1669, 1670, 1671, 1672, 1673,
       1674, 1675, 1676, 1677, 1678, 1679, 1680, 1681, 1682, 1683, 1684,
       1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, 1693, 1694, 1695,
       1696, 1697, 1698, 1699, 1700, 1701, 1702])

Note that, for this molecule, the hydrogens of waters are not present, so we only get one index per water without using the `np.unique`.

The number of waters present in the `Molecule` can be obtained with:

In [14]:
len(mol.get("serial",sel="water"))

62

## Create selections of atoms

The `Molecule.atomselect` method returns a vector of boolean values:

In [15]:
mol.atomselect("water")

array([False, False, False, ...,  True,  True,  True], dtype=bool)

The fact that `True` counts as 1 in the `sum` function can be used to obtain, through a different method, the number of waters:

In [16]:
selection = mol.atomselect("water")
print(selection)
sum(selection)

[False False False ...,  True  True  True]


62

## Representations and Visualization

The `Molecule` objects can be visualized either in VMD or in NGL, a WebGL javascript molecule viewer that's integrated in the Notebook (see above for viewer configuration).

In [17]:
mol = Molecule('3PTB')
mol.view()

It is possible to apply multiple representations to a `Molecule` as in VMD. Representations use the same names as in VMD, even when using the NGL viewer. Important parameters are: **style**, **color**, and **sel**.   

There are two ways of applying representations.

### The "quick" or "transient" view

Use the `Molecule.view` method, specifying the representation as arguments. Use the `hold` parameter so that following `Molecule.view` calls can overlay. Otherwise, representations will be cleared on every call.

In [18]:
mol.view(sel='protein', style='NewCartoon', color='Index', hold=True)
mol.view(sel='resname BEN', style='Licorice', color=1)

### The "explicit" way, for which representations are added to `Molecule.reps`

One directly manipulates elements in the `reps` property of `Molecule` objects, with the views being stored in that property.

In [19]:
mol.reps.remove()   # Clear representations
mol.reps.add(sel='protein', style='NewCartoon', color='Index')
mol.reps.add(sel='resname BEN', style='Licorice', color=1)
print(mol.reps)     # Show list of representations (equivalent to mol.reps.list())
mol.view()

rep 0: sel='protein', style='NewCartoon', color='Index'
rep 1: sel='resname BEN', style='Licorice', color='1'



## Atom selection expressions work as in VMD

The following shows the molecule without a $6 Å$ thick slab ($-3 Å \le x \le +3 Å$).

In [20]:
mol.reps.remove() # in order to remove the previouly stored representations
mol.view(sel='x*x>9')

## Working with trajectories

`Molecule` provides wrapping and aligning functionallity for working with MD trajectories and improving the visualization.

In [21]:
# molTraj = Molecule('data/filtered.pdb')
# molTraj.read('data/traj.xtc')
# molTraj.view()

## A realistic case study

In [22]:
# Load the 'clean' molecule once again
mol = Molecule('3PTB')

In [23]:
# Identify residues in contact with the ligand BEN
mol.get("resid", sel="name CA and same residue as protein within 4 of resname BEN")

array([189, 190, 191, 192, 195, 213, 215, 216, 219, 220, 226])

Identify duplicate residues, based on PDB's insertion attribute:

In [24]:
# The quick way
np.unique(mol.get('resid', sel='insertion A'))

array([184, 188, 221])

In [25]:
# Same operation, more explicit steps and pretty-print
ia = mol.copy()
ia.filter("insertion A and name CA")
rid = ia.get('resid') # ia.resid also works!
rn = ia.get('resname')

for f, b in zip(rn, rid):
    print(f, b)

2016-04-11 20:18:41,420 - htmd.molecule.molecule - INFO - Removed 1698 atoms. 3 atoms remaining in the molecule.
GLY 184
GLY 188
ALA 221


In [26]:
# Or, if one doesn't want to rely on the attribute
dups = mol.copy()
dups.filter("name CA and protein")
rid = dups.get('resid')

nrid, count= np.unique(rid,return_counts=True)
nrid[count>1]

2016-04-11 20:18:41,458 - htmd.molecule.molecule - INFO - Removed 1478 atoms. 223 atoms remaining in the molecule.


array([184, 188, 221])

In [27]:
count

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [28]:
# Check whether there are "(numeric) holes" in the sequence using residue IDs
ch = mol.copy()
ch.filter("name CA and protein")
rid = ch.get('resid')
rn = ch.get('resname')

# array with the "holes" - 0 means duplicate residues; >1 means segments of protein missing
deltas = np.diff(rid)
print(deltas)

# print residue IDs at which the holes (including duplicate residues) occur
new_rid = rid[:np.size(rid)-1] # last residue difference not present in deltas array
new_rid[deltas!=1]

2016-04-11 20:18:41,508 - htmd.molecule.molecule - INFO - Removed 1478 atoms. 223 atoms remaining in the molecule.
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 5 1 1 1 1 1 1 1 1 2 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


array([ 34,  67, 125, 130, 184, 188, 204, 217, 221])

In [29]:
# Pretty-print, more explicit

# Iterate over all residues (excluding last one)
for i in range(np.size(rid)-1):
    # If there is a break...
    if(deltas[i]>1):
        # Remember that deltas[i]=rid[i+1]-rid[i]
        print(rid[i],rn[i],' followed by ',rid[i+1],rn[i+1])

34 ASN  followed by  37 SER
67 LEU  followed by  69 GLY
125 THR  followed by  127 SER
130 SER  followed by  132 ALA
204 LYS  followed by  209 LEU
217 SER  followed by  219 GLY
