# Adaptive sampling

In this tutorial, we will showcase how to use adaptive sampling simulations on a molecular system. The sample system in this case is the NTL9 protein.

Let's import HTMD and do some definitions:

In [1]:
from htmd.ui import *

2024-06-11 16:32:33,117 - numexpr.utils - INFO - Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-06-11 16:32:33,117 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.
2024-06-11 16:32:33,286 - rdkit - INFO - Enabling RDKit 2022.09.1 jupyter extensions



Please cite HTMD: Doerr et al.(2016)JCTC,12,1845. https://dx.doi.org/10.1021/acs.jctc.6b00049
HTMD Documentation at: https://software.acellera.com/htmd/

You are on the latest HTMD version (2.3.28+5.g9dbc5091a.dirty).



## Get the generators folder structure

Get the data for this tutorial [here](http://pub.htmd.org/tutorials/adaptive-sampling/generators.tar.gz). Alternatively, you can download the data using `wget`:

In [2]:
import os, glob
assert os.system('wget -rcN -np -nH -q --cut-dirs=2 -R index.html* http://pub.htmd.org/tutorials/adaptive-sampling/generators/') == 0
for file in glob.glob('./generators/*/run.sh'):
    os.chmod(file, 0o755)

In [3]:
!tree generators | head -20

[01;34mgenerators[0m
├── [01;34mntl9_1ns_0[0m
│   ├── input
│   ├── input.coor
│   ├── input.xsc
│   ├── parameters
│   ├── [01;32mrun.sh[0m
│   ├── structure.pdb
│   └── structure.psf
├── [01;34mntl9_1ns_1[0m
│   ├── input
│   ├── input.coor
│   ├── input.xsc
│   ├── parameters
│   ├── [01;32mrun.sh[0m
│   ├── structure.pdb
│   └── structure.psf
└── [01;34mntl9_1ns_2[0m
    ├── input
    ├── input.coor


## Adaptive classes

HTMD has two types of adaptive sampling:

* AdaptiveMD (free exploration)
* AdaptiveGoal (exploration + exploitation)

Create a directory for each type of adaptive and copy the generators into them:

In [4]:
os.makedirs('./adaptivemd', exist_ok=True)
os.makedirs('./adaptivegoal', exist_ok=True)
shutil.copytree('./generators', './adaptivemd/generators')
shutil.copytree('./generators', './adaptivegoal/generators')

'./adaptivegoal/generators'

## AdaptiveMD

Let's change directory to the `adaptivemd` one and work there:

In [5]:
os.chdir('./adaptivemd')

* Setup the queue that will be used for simulations. 
* Tell it to store completed trajectories in the data folder as this is where `AdaptiveMD` expects them to be by default

In [6]:
queue = LocalGPUQueue()
queue.datadir = './data'

In [7]:
ad = AdaptiveMD()
ad.app = queue

* Set the `nmin`, `nmax` and `nepochs`

In [8]:
ad.nmin = 1
ad.nmax = 3
ad.nepochs = 3

* Choose what projection to use for the construction of the Markov model

In [9]:
protsel = 'protein and name CA'
ad.projection = MetricSelfDistance(protsel)

* Set the `updateperiod` of the Adaptive to define how often it will poll for completed simulations and redo the analysis

In [10]:
ad.updateperiod = 120 # execute every 2 minutes

Launch the `AdaptiveMD` run:

In [11]:
ad.run()

2024-06-11 16:32:58,958 - htmd.adaptive.adaptive - INFO - Processing epoch 0
2024-06-11 16:32:58,960 - htmd.adaptive.adaptive - INFO - Epoch 0, generating first batch
2024-06-11 16:32:58,983 - jobqueues.util - INFO - Trying to determine all GPU devices
2024-06-11 16:32:59,035 - jobqueues.localqueue - INFO - Using GPU devices 0
2024-06-11 16:32:59,037 - jobqueues.util - INFO - Trying to determine all GPU devices
2024-06-11 16:32:59,093 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivemd/input/e1s1_ntl9_1ns_0
2024-06-11 16:32:59,095 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivemd/input/e1s2_ntl9_1ns_1
2024-06-11 16:32:59,097 - jobqueues.localqueue - INFO - Running /home/sdoerr/Work/htmd/tutorials/adaptivemd/input/e1s1_ntl9_1ns_0 on device 0
2024-06-11 16:32:59,101 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivemd/input/e1s3_ntl9_1ns_2
2024-06-11 16:32:59,105 - htmd.adaptive.adaptive -

## AdaptiveGoal

Now let's change to the `adaptivegoal` directory and work there instead:

In [12]:
os.chdir('../adaptivegoal')

* Most of the class arguments are identical to AdaptiveMD

In [13]:
adg = AdaptiveGoal()
adg.app = queue
adg.nmin = 1
adg.nmax = 3
adg.nepochs = 2
adg.generatorspath = './generators'
adg.projection = MetricSelfDistance('protein and name CA')
adg.updateperiod = 120  # execute every 2 minutes
adg.goalfunction = None  # set to None just as an example

* It requires the `goalfunction` argument which defines a goal
* We can define a variety of different goal functions

## The goal function

The goal function will:
* take as input a `Molecule` object of a simulation and 
* produce as output a score for each frame of that simulation. 
* The higher the score, the more desirable that simulation frame for being respawned.

### RMSD goal function

For this goal function, we will use a crystal structure of NTL9.

You can download the structure from the following link and save it on the `adaptivegoal` directory:

* [NTL9 crystal structure](http://pub.htmd.org/tutorials/adaptive-sampling/ntl9_crystal.pdb).

Alternatively, you can download the structure using `wget`.

In [14]:
assert os.system('wget -q http://pub.htmd.org/tutorials/adaptive-sampling/ntl9_crystal.pdb') == 0

We can define a simple goal function that uses the RMSD between the conformation sampled and a reference (in this case, the crystal structure), and returns a score to be evaluated by the `AdaptiveGoal` algorithm:

In [15]:
ref = Molecule('./ntl9_crystal.pdb')

def mygoalfunction(mol):
    rmsd = MetricRmsd(ref, 'protein and name CA').project(mol)
    return -rmsd  # or even 1/rmsd

adg.goalfunction = mygoalfunction

`AdaptiveGoal` ranks conformations from a high to low score. For the case of RMSD, since we want lower RMSD to give higher score, the symetric value is returned instead (the inverse would also work).

Launch the `AdaptiveGoal` run:

In [16]:
adg.run()

2024-06-11 16:49:07,264 - htmd.adaptive.adaptive - INFO - Processing epoch 0
2024-06-11 16:49:07,265 - htmd.adaptive.adaptive - INFO - Epoch 0, generating first batch
2024-06-11 16:49:07,286 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s1_ntl9_1ns_0
2024-06-11 16:49:07,287 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s2_ntl9_1ns_1
2024-06-11 16:49:07,287 - jobqueues.localqueue - INFO - Running /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s1_ntl9_1ns_0 on device 0
2024-06-11 16:49:07,288 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s3_ntl9_1ns_2
2024-06-11 16:49:07,289 - htmd.adaptive.adaptive - INFO - Sleeping for 120 seconds.
2024-06-11 16:51:07,308 - htmd.adaptive.adaptive - INFO - Processing epoch 1
2024-06-11 16:51:07,309 - htmd.adaptive.adaptive - INFO - Retrieving simulations.
2024-06-11 16:51:07,309 - htmd.adaptive.adaptiv

### Functions with multiple arguments

The goal function can also take multiple arguments. This allows flexibility and on-the-fly comparisons to non-static conformations (i.e. compare with different references as the run progresses). Here, we redefine the previous goal function with multiple arguments:

In [17]:
def newgoalfunction(mol, crystal):
    rmsd = MetricRmsd(crystal, 'protein and name CA').project(mol)
    return -rmsd  # or even 1/rmsd

Now we clean the previous `AdaptiveGoal` run, and start a new one with the new goal function:

In [18]:
# clean previous run
shutil.rmtree('./input')
shutil.rmtree('./data')
shutil.rmtree('./filtered')

# run with new goal
ref = Molecule('./ntl9_crystal.pdb')
adg.goalfunction = (newgoalfunction, (ref,))
adg.run()

2024-06-11 17:01:09,923 - htmd.adaptive.adaptive - INFO - Processing epoch 0
2024-06-11 17:01:09,924 - htmd.adaptive.adaptive - INFO - Epoch 0, generating first batch
2024-06-11 17:01:09,935 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s1_ntl9_1ns_0
2024-06-11 17:01:09,935 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s2_ntl9_1ns_1
2024-06-11 17:01:09,935 - jobqueues.localqueue - INFO - Running /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s1_ntl9_1ns_0 on device 0
2024-06-11 17:01:09,936 - jobqueues.localqueue - INFO - Queueing /home/sdoerr/Work/htmd/tutorials/adaptivegoal/input/e1s3_ntl9_1ns_2
2024-06-11 17:01:09,936 - htmd.adaptive.adaptive - INFO - Sleeping for 120 seconds.
2024-06-11 17:03:10,040 - htmd.adaptive.adaptive - INFO - Processing epoch 1
2024-06-11 17:03:10,049 - htmd.adaptive.adaptive - INFO - Retrieving simulations.
2024-06-11 17:03:10,050 - htmd.adaptive.adaptiv

### Other goal function examples

HTMD includes other two goal functions: The secondary structure goal function and the contacts goal function.

#### Secondary structure goal function

In [19]:
ref = Molecule('./ntl9_crystal.pdb')

def ssGoal(mol, crystal):
    crystalSS = MetricSecondaryStructure().project(crystal)[0]
    proj = MetricSecondaryStructure().project(mol)
    # How many crystal SS match with simulation SS
    ss_score = np.sum(proj == crystalSS, axis=1) / proj.shape[1]  
    return ss_score

adg.goalfunction = (ssGoal, (ref,))

#### Contacts goal function

In [20]:
ref = Molecule('./ntl9_crystal.pdb')

def contactGoal(mol, crystal):
    crystalCO = MetricSelfDistance('protein and name CA', pbc=False,
                                   metric='contacts', 
                                   threshold=10).project(crystal)
    proj = MetricSelfDistance('protein and name CA', 
                              metric='contacts', 
                              threshold=10).project(mol)
    # How many crystal contacts are seen?
    co_score = np.sum(proj[:, crystalCO] == 1, axis=1)
    co_score /= np.sum(crystalCO)
    return ss_score

adg.goalfunction = (contactGoal, (ref,))

Many more goal functions can be devised.