# Introduction

Once we finished preparing our crystal system, we can finally start the production runs. We want to clarify our goals here. The goal is to use MD simulation with a sufficiently good protein forcefield to sample protein configurations near the experimentally observed structure. Then by the ergodicity hypothesis, we can take snapshots over the trajectory to represent the ensemble of unit cell structure from the actual crystal for which we collect the diffraction data. To be clear, the structure obtained from a diffraction experiment is a time average over the duration of X-ray illumination and also a spatial average over all unit cells in the crystal. Here, because we can use the cheap computational power available to us over more expensive experimental setups, we are hoping that the time averaged structure from a sufficiently long MD simulation of the crystal is comparable to the experimentally determined structure. 

We note a caveat that we are just simulating a 1x1x1 unit cell, so with the periodic boundary condition, we introduce artificial periodicity in the crystal system represented by our simulation. This may lead to subtle artefacts like relative shifts of average positions of individual proteins in the unit cell system and also slight distortions of average protein structures that cannot be easily eliminated by superposition (alignment). And my experience is that these effects appear early won't go away unless we simulate for very long time, i.e., we got stuck in some metastable crystal conformation. So the strategy here is to run many replicates under the same simulation settings. Besides, we also note that systematic errors in the forcefield may cause native conformations to be less stable than it actually is, which can lead to shifts in the average position of the molecules relative to the unit cell and to the crystal structure that cannot be corrected by long simulations or averaging across replicates.

This notebook primarily explains details about the production run and how our implementation allows the post-processing steps on MD trajectories to be run in parallel with the production run step, thus saving time and efforts from monitoring the tasks and alternate between the scripts manually.

# Step 1. Performing squeeze run replicates

Empirically, we found that using a single post-squeeze crystal system for production runs does not yield good enough statistics from the electron density map calculations later. This is a convergence issue and is likely due to the artificial PBC discussed above. We noted that simulating longer trajectories (several hundred ns) does not help much, but we can approach convergence much faster using squeeze run replicates. 

This means we will start from the same `neutralized.pdb` structure from the previous step, make copies of this system, and for each system independently and randomly add waters and squeeze until its volume stabilizes. Although the volumes would vary slightly between replicates, this method should allow the crystal systems to relax into various local minimum conformations as the restraints are tapered off during the squeeze step and thus helps with sampling.

Here's the simple shell script to make a folder for holding each simulation replicate:

In [1]:
with open('./make_replicate.sh', 'r') as f:
    print(f.read())

mkdir $1 
cd $1
ln -s ../../CRO_parametrization/cro.xml .
ln -s ../../Crystal_system_construction/neutralized.pdb .
ln -s ../../Crystal_system_construction/squeeze_run.py .
ln -s ../../Crystal_system_construction/squeeze_run.sh
sbatch squeeze_run.sh


Then we can execute the script with the first argument ranging from 0 to 4 to make 5 replicates. For the squeeze step, each simulation should take no more than 3 hours. The trajectory files should be about 2.5 GiB.

# Step 2. Parallel production runs & post-processing

## 2.1. Production run
We have collected diffraction data for eGFP crystal pumped by a terahertz (THz) waveform from laser. This pulse beam induces a strong and transient electric field in our protein crystal of interest. To simulate experimental conditions where the single-cycle THz pulse has a nontrivial waveform (See THz draft manuscript), we assumed that the pulse can be approximated by three phases: 1ps constant field in +ve x-direction (crystallographic α-axis), 1ps in the -ve x-direction, and 8ps with no field. In the Stark effect study we have used 300kV/cm field, but here we explore using 1-10MV/cm fields which give stronger vibrational responses and provide interesting predictions that could be tested by more intense laser sources in the future. To apply the electric field in our MD simulation, we attach a custom force to every atom proportional to both its partial charge and the field strength. This application of uniform field appears justified as the shortest 1ps +/-ve field duration translates to 0.3mm, of the same length scale as the longest dimension of the mounted crystal. 

Empirically we've verified that the relaxation time is shorter than 8ps, so we can continuously apply pulses in MD simulations to obtain the trajectories for analysis. We will still record the trajectory at 0.1 picosecond interval, but since within each phase of a pulse the system configurations will be highly correlated, during the post-processing step we only keep one sample per phase. 

The following code (courtesy of Jack's initial code snippet) demonstrates how to perform a simulation consisting of 5,000 pulses (50ns) using the parameters described above: 

In [2]:
with open('./terahertz_pulse_run.py', 'r') as f:
    print(f.read())

# import stuff
from openmm.app import PDBFile, ForceField
from openff.toolkit.topology import Molecule
from openmmforcefields.generators import GAFFTemplateGenerator
from simtk.unit import *
import mdtraj
import mdtools
import pickle
import argparse
import os

def getFieldStrength(e):
    """
    Convert from given unit of electric field strength to a value
    in OpenMM's standard unit
    """
    def convert(v):
        return (v*AVOGADRO_CONSTANT_NA).value_in_unit(kilojoule_per_mole / elementary_charge / nanometer)

    if isinstance(e, list):
        return [convert(e_c) for e_c in e]
    else:
        return convert(e)

    
parser = argparse.ArgumentParser()
parser.add_argument("-E", "--E", type=str, help="Field strength of the THz pulse (MV/cm)", default=10e8)
parser.add_argument("-t0", "--t0", type=int, help="Initial duration of NVT equilibration (ns)", default=10)
parser.add_argument("-t1", "--t1", type=int, help="Number of pulse cycles for the production run")
parser.add_argu

## 2.1 Post-processing and parallel processing

Given the raw trajectories, we need many steps of post-processing to arrive at the computed electron density map. First we remove solvent and shift the protein chains by unit cell dimensions if their center of mass drifted across the periodic boundary. Next, we align the structures by performing chain-wise alignments (each chain is an asymmetric unit in the crystal), equivalent to putting proteins on symmetry-related positions and thus eliminates some of the lattice distortion that we are not interested in. Then we split the trajectory into positive, negative, and zero field phases and subsample over cycles of field oscillation. We will do this for each of the four chains in our 1x1x1 EGFP crystal system. Finally, we compute the structure factors from all these snapshots using Phenix and average along the time axis, giving us the average structure factor for each chain and for each phase. 

The Python script for performing these steps in parallel to the production run step is shown below. We now proceed to explain the usefulness of parallel processing here.

In [3]:
with open('./thz_continuous_postproc.py', 'r') as f:
    print(f.read())

from simtk.unit import *
from tqdm import tqdm
import mdtraj
import mdtools
from mdtools.utils import *
import argparse
import subprocess
import os
import multiprocessing as mp
from itertools import product

fifo_name = 'fifo_pipe'
cpu_count = mp.cpu_count()
print(f'Running on {cpu_count} cpus')

# parameters
parser = argparse.ArgumentParser()
parser.add_argument("-E", "--E", type=str, help="Field strength of the THz pulse (MV/cm)", default=10e8)
parser.add_argument("-t0", "--t0", type=int, help="Initial duration of NVT equilibration (ns)", default=10)
parser.add_argument("-t1", "--t1", type=int, help="Number of pulse cycles for the production run")
parser.add_argument("-t2", "--n_pulses", type=int, help="Number of pulses per cycle", default=100)
parser.add_argument("-i", "--input", type=str, help="Input file for the crystal system", default="squeezed.pdb")
parser.add_argument("-o", "--output", type=str, help="Prefix for the output trajectory and state files")
parser.add_argument("-n",

We note that the post-processing step depends on the data from the production run step. Hence, the idea is simple - once we have some trajectory from the production run, we can immediately post-process it separately from the production run that continues to produce more trajectory data. We can thus run these two steps in parallel using two separate scripts. However, in reality the two steps can occur at different rates, so we need to coordinate the two processes, i.e., the post-processing will start once the production run reaches a milestone (say 5,000 cycles), and then the production run will pause at the next milestone to wait for the post-processing finishes with the previous data. We use the simplest implementation of a blocking fifo file object that acts as a pipe between the two Python processes. The post-processing script is shown above. And the production run script is rewritten accordingly:

In [4]:
with open('./thz_continuous_simulation.py', 'r') as f:
    print(f.read())

from openmm.app import PDBFile, ForceField
from openff.toolkit.topology import Molecule
from openmmforcefields.generators import GAFFTemplateGenerator
from simtk.unit import *
from tqdm import tqdm
import mdtraj
from mdtraj.reporters import HDF5Reporter
import mdtools
from mdtools.utils import *
import pickle
import argparse
import subprocess
import os

#
fifo_name = 'fifo_pipe'

# parameters
parser = argparse.ArgumentParser()
parser.add_argument("-E", "--E", type=str, help="Field strength of the THz pulse (MV/cm)", default=10e8)
parser.add_argument("-t0", "--t0", type=int, help="Initial duration of NVT equilibration (ns)", default=10)
parser.add_argument("-t1", "--t1", type=int, help="Number of pulse cycles for the production run")
parser.add_argument("-t2", "--n_pulses", type=int, help="Number of pulses per cycle", default=100)
parser.add_argument("-i", "--input", type=str, help="Input file for the crystal system", default="squeezed.pdb")
parser.add_argument("-o", "--output", type=st

All that remains is to fire up these two scripts with proper arguments for each of the five simulation replicates. 

# Step 3. Putting everything together

The following shell script creates two processes inside one slurm job. The two processes will share 1 GPU and 32 CPU cores, using upto 128 GiB memory (mostly for post-processing) and run for upto 1 day to simulate and process 100ns of trajectory (empirically, I've achieved upto 180 ns/day with the current setup but this varies with the specifics of the nodes used):

In [5]:
with open('./thz_continuous.sh', 'r') as f:
    print(f.read())

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH -t 1-00:00          # Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p gpu
#SBATCH -c 32
#SBATCH --mem=128G           # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o thz_cont_gpu_%A_%a.out
#SBATCH -e thz_cont_gpu_%A_%a.err

source /n/home11/ziz531/.bashrc
source  $LOGIN/.bash_addons
conda activate openmm
OPENMM_DEFAULT_PLATFORM=CUDA
python /n/holyscratch01/hekstra_lab/ziyuan/EF-X-crystal-MD/Production_runs/thz_continuous_simulation.py -n 4 -o EF_10MV_cm -t0 0 -t1 200 -t2 100 -r $1&
python /n/holyscratch01/hekstra_lab/ziyuan/EF-X-crystal-MD/Production_runs/thz_continuous_postproc.py -n 4 -o EF_10MV_cm -t0 0 -t1 200 -t2 100 -r $1




For each replicate we will first create the fifo pipe and then call the above script for its corresponding folder:

In [6]:
with open('./replicate_production_run.sh', 'r') as f:
    print(f.read())

cd $1
mkfifo fifo_pipe
mkdir data
ln -s ../asu_ref.h5 .
ln -s ../atoms_for_alignment.npy .
sbatch ../thz_continuous.sh $1



Before running the script, we need to prepare two auxiliary files for performing chainwise alignment. The first file is `asu_ref.h5` for the structure of the protein asymmetric unit (ASU). For simplicity we will just use the post-neutralization structure and strip all waters. And the one-line code for this is shown below. (Note: you must run it before calling `replicate_production_run.sh`, which will create link to this file from the replicates folders) The second file `atoms_for_alignment.npy` encodes the index of all Ca atoms that are relatively fixed throughout the simulation. Please refer to the analysis notebook for how to obtain it. Or you can use the existing file in the current folder if you do not want to jump ahead.

In [1]:
import mdtraj 

mdtraj.load('../Crystal_system_construction/neutralized.pdb')[0].remove_solvent().save('asu_ref.h5')

