In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import molsysmt as msm

_ColormakerRegistry()

# Elements selection

Elements selections is probably the most frequently task when we work with molecular systems. There are many circumstances under which we need to know list of elements acomplishing a certain condition. We probably need, for instance, to calculate de contact map between CA atoms from two chains, or to remove the solvent atoms or to know how many 'HIS' residues there are in a peptide. All these conditions can be expresed as a sentence that the query over elements needs to match. Each library or MD engine or molecular visualization software have each own syntaxis to write this sentence. You can see different examples in MDTraj, PyTraj, Amber, Pymol or VMD.

## MolSysMT selection syntaxis

MolSysMT has its own selection syntaxis based on the attributes of the elements as atoms, groups, molecules, etc. Lets load a molecular system to explain the logic behind this syntaxis:

In [3]:
file_path = msm.test_systems.files['1tcd.mmtf']

In [4]:
molecular_system = msm.convert(file_path, to_form='molsysmt.MolSys')

A molecular system encoded as the native form 'MolSys' has a pandas DataFrame with the atoms breakdown:

In [5]:
molecular_system.topology

Unnamed: 0,atom.index,atom.name,atom.id,atom.type,atom.formal_charge,atom.bonded_atom_indices,group.index,group.name,group.id,group.type,...,chain.id,chain.type,molecule.index,molecule.name,molecule.id,molecule.type,entity.index,entity.name,entity.id,entity.type
0,0,N,1,N,0.0,[1],0,LYS,4,aminoacid,...,A,,0,TRIOSEPHOSPHATE ISOMERASE,0,protein,0,TRIOSEPHOSPHATE ISOMERASE,0,protein
1,1,CA,2,C,0.0,"[0, 2, 4]",0,LYS,4,aminoacid,...,A,,0,TRIOSEPHOSPHATE ISOMERASE,0,protein,0,TRIOSEPHOSPHATE ISOMERASE,0,protein
2,2,C,3,C,0.0,"[1, 3, 9]",0,LYS,4,aminoacid,...,A,,0,TRIOSEPHOSPHATE ISOMERASE,0,protein,0,TRIOSEPHOSPHATE ISOMERASE,0,protein
3,3,O,4,O,0.0,[2],0,LYS,4,aminoacid,...,A,,0,TRIOSEPHOSPHATE ISOMERASE,0,protein,0,TRIOSEPHOSPHATE ISOMERASE,0,protein
4,4,CB,5,C,0.0,"[1, 5]",0,LYS,4,aminoacid,...,A,,0,TRIOSEPHOSPHATE ISOMERASE,0,protein,0,TRIOSEPHOSPHATE ISOMERASE,0,protein
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3978,3978,O,3979,O,0.0,[],657,HOH,339,water,...,D,,161,water,161,water,1,water,1,water
3979,3979,O,3980,O,0.0,[],658,HOH,340,water,...,D,,162,water,162,water,1,water,1,water
3980,3980,O,3981,O,0.0,[],659,HOH,341,water,...,D,,163,water,163,water,1,water,1,water
3981,3981,O,3982,O,0.0,[],660,HOH,342,water,...,D,,164,water,164,water,1,water,1,water


As you can see, the column names are the fundamental attributes of the molecular system elements:

In [6]:
print(molecular_system.topology.columns)

Index(['atom.index', 'atom.name', 'atom.id', 'atom.type', 'atom.formal_charge',
       'atom.bonded_atom_indices', 'group.index', 'group.name', 'group.id',
       'group.type', 'component.index', 'component.name', 'component.id',
       'component.type', 'chain.index', 'chain.name', 'chain.id', 'chain.type',
       'molecule.index', 'molecule.name', 'molecule.id', 'molecule.type',
       'entity.index', 'entity.name', 'entity.id', 'entity.type'],
      dtype='object')


The syntaxis proposed by Pandas to perform queries in a pandas.DataFrame is the base of the MolSysMT selection procedure. The boolean syntaxis of Pandas includes the following words and symbols:

<center>

| Word | Symbol | Meaning |
|---|---|---|
| and | & | and |
| or | \| | or |
| not | ~ | not |
| in | | in |
|  | == | equal |
|  | != | not equal |
|  | < | less than |
|  | <= | less or equal than |
|  | > | greater than |
|  | >= | greater or equal than |

</center>

As such, the selection sentence can also include the reference to external lists. Lets see some simple examples.

### Simple atoms selection by their attributes or properties
The following are some examples where a list of atoms is obtained matching some selection criteria:

In [7]:
# Atoms with name C
msm.select(molecular_system, 'atom.name == "C"')

array([   2,   11,   18, ..., 3798, 3803, 3810])

In [8]:
# Atoms with name CA or CB
msm.select(molecular_system, 'atom.name in ["CA","CB"]')

array([   1,    4,   10, ..., 3805, 3809, 3812])

In [9]:
# Atoms of type C or N
msm.select(molecular_system, 'atom.type==["C","N"]')

array([   0,    1,    2, ..., 3814, 3815, 3816])

In [10]:
# Heavy atoms
msm.select(molecular_system, 'not atom.type=="H"')

array([   0,    1,    2, ..., 3980, 3981, 3982])

In [11]:
# Atoms of type C not named CA
msm.select(molecular_system, 'atom.type=="C" and not atom.name=="CA"')

array([   2,    4,    5, ..., 3813, 3814, 3815])

In [12]:
# Atoms not named CA, CB or C
msm.select(molecular_system, 'atom.name!=["CA","CB","C"]')

array([   0,    3,    5, ..., 3980, 3981, 3982])

In [13]:
# Atoms with id number lower than 10
msm.select(molecular_system, 'atom.id<10')

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [14]:
# Atoms with id number lower than 10 and higher or equal than 3
msm.select(molecular_system, 'atom.id<10 and atom.id>=3')

array([2, 3, 4, 5, 6, 7, 8])

### Including other elements attributes

Atoms can be selected using attributes of other the other elements in the hierarchical organization of the molecular system: 'group', 'component', 'molecule', 'chain', 'entity' or 'bioassembly'. You can find further information of these elements in XXX. These are some examples of selection sentences including other criteria than atoms attributes:

In [15]:
# Atoms belonging to molecules of type water.
msm.select(molecular_system, 'molecule.type=="water"')

array([3818, 3819, 3820, ..., 3980, 3981, 3982])

In [16]:
# Heavy atoms belonging to molecules of type protein.
msm.select(molecular_system, 'molecule.type=="protein" and atom.type!="H"')

array([   0,    1,    2, ..., 3815, 3816, 3817])

In [17]:
# Atoms belonging to residues named GLY, ALA or VAL in chain named A.
msm.select(molecular_system, 'group.name==["GLY","ALA","VAL"] and chain.name=="A"') 

array([  40,   41,   42, ..., 1886, 1887, 1888])

### Including external variables

Pandas query method allows the use of external variables in the logical sentence. To include them, variables names have to be preceded by the character '@'. Lets illustrate its use with some examples:

In [18]:
# Atoms in groups with indices 10, 11 or 12.
indices=[10,11,12]
msm.select(molecular_system, 'group.index==@indices')

array([77, 78, 79, ..., 97, 98, 99])

In [19]:
# Atoms named CA, C, O or N in groups with indices 10 to 29.
indices=list(range(10,30))
atoms=["CA", "C", "O", "N"]
msm.select(molecular_system, 'atom.name==@atoms & atom.index==@indices') 

array([10, 11, 12, ..., 26, 27, 28])

### Including mask filters

Although including masks is not really necessary, `molsysmt.select()` has an optional input argument to do so:

In [20]:
# Atoms named C with atom index in range 10 to 29
indices=list(range(10,30))
msm.select(molecular_system, 'atom.name=="C"', mask=indices)

array([11, 18, 27])

The use of masks can always be avoid using the logical sentence:

In [21]:
# Atoms named C with atom index in range 10 to 29
indices=list(range(10,30))
msm.select(molecular_system, 'atom.name=="C" and atom.index in @indices')

array([11, 18, 27])

### Selection of other elements

The selection method of MolSysMT can also return other elements indices than atoms. As many methods in this library, `molsysmt.select()` has an input argument named `target` to select the elements nature of the output list of indices. Lets see some examples:

In [22]:
# Groups with indices equal to 0, 100 or 200
indices=[0,100,200]
msm.select(molecular_system, 'group.index==@indices', target='group')

array([  0, 100, 200])

In [23]:
# Groups with name "ALA"
msm.select(molecular_system, 'group.name=="ALA"', target='group')

array([  5,   6,   7, ..., 465, 482, 494])

In [24]:
# Groups of atoms index 34, 44 or 64
msm.select(molecular_system, 'atom.index==[34,44,64]', target='group')

array([4, 5, 9])

In [25]:
# Groups belonging to chain named A and molecule of type anything but water
msm.select(molecular_system, 'chain.name=="A" and molecule.type!="water"', target='group')

array([  0,   1,   2, ..., 245, 246, 247])

In [26]:
# Groups of molecules of type water
msm.select(molecular_system, 'molecule.type=="water"', target='group')

array([497, 498, 499, ..., 659, 660, 661])

In [27]:
# Molecules of type water
msm.select(molecular_system, 'molecule.type=="water"', target='molecule')

array([  1,   2,   3, ..., 163, 164, 165])

In [28]:
# Chains with molecules of type water
msm.select(molecular_system, 'molecule.type=="water"', target='chain')

array([2, 3])

Finnally, notice that `mask` is always acting over the targeted elements:

In [29]:
# Atoms with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'atom.index in [0,1,2,3,4]', mask=[0,1,2], target='atom')

array([0, 1, 2])

In [30]:
# Groups with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'group.index in [0,1,2,3,4]', mask=[0,1,2], target='group')

array([0, 1, 2])

In [31]:
# Molecules with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'molecule.index in [0,1,2,3,4]', mask=[0,1,2], target='molecule')

array([0, 1, 2])

## Syntaxis translation

MolSysMT is prepared to easily interact with other tools. The main goal of this library is providing with a set of pipes and joins to set up your workflows, keeping simple the integration of other tools. But different tools have different selection syntaxis. Learning how to use the selection syntaxis of MDTraj, ParmEd or NGLview is something very useful. Those are tools that we all use frequently in our labs. But it happens that we forget soon the rules of each tool. To keep a unique selection syntaxis in your projects, MolSysMT includes the input argument `to_syntaxis` in the method `molsysmt.select()`. Lets illustrate some examples:

In [38]:
msm.select(molecular_system, selection='group.index==[3,4,5]', to_syntaxis='NGLview')

'@25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44'

In [41]:
msm.select(molecular_system, selection='group.index==[3,4,5]', to_syntaxis='MDTraj')

'index 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44'

The output string can be obtained, if the selection is done over other targetted elements, as a sequence of groups or chains:

In [39]:
msm.select(molecular_system, target='group', selection='group.index==[3,4,5]', to_syntaxis='NGLview')

'7:A 8:A 9:A'

In [42]:
msm.select(molecular_system, target='group', selection='group.index==[3,4,5]', to_syntaxis='MDTraj')

'resid 3 4 5'

### Output syntaxis supported

MolSysMT translates selection sentences from its own native syntaxis to NGLview, MDTraj, Pytraj, ParmEd and AMBER.

## Using your favourite selection syntaxis

To be implemented.