# Protein Mutation Tutorial for $\lambda$ Dynamics Workshop

## I. Introduction

### I.A. Free Energy Cycle

Today we will be evaluating relative protein folding free energies (i.e. relative stabilities) by considering the free energy of mutation in the folded ensemble and in the unfolded ensemble.

Other free energies, such as relative protein-protein binding free energies can be obtained with these same techniques by evaluating the mutations in different ensembles.

<img src="Figures/CycleBoth.jpg" alf="cycle" width=400/>

### I.B. Approximations

The unfolded ensemble is structurally diverse and relaxes slowly. To avoid sampling this difficult ensemble, we note that amino acids are extended and solvent exposed in the unfolded ensemble, and approximate the unfolded ensemble with a short pentapeptide. The mutating amino acid is flanked by two non-mutating residues on each side.

While a full protein is typically capped by charged caps under normal pH conditions, the ends of this peptide are artificially truncated, so we cap them with neutral groups (ACE ad CT3 in CHARMM) to avoid perturbing local electrostatics.

When multiple mutation sites are considered, we approximate their effects as being additive in the unfolded ensemble, unless they are adjacent or next to adjacent in sequence. Nearby residues such as these are modeled in the same peptide with two non-mutating residues on each side.

### I.C. Patches

For small molecule perturbations, one must go through the process of generating new patches for each new systems because the ligands and perturbations of interest are always unique. Protein mutations are more convenient in this regard. Protein mutations are always the same, so the set of patches may be generated once and applied to any system.

#### I.C.1. Side Chain Patches

The first study amino acid mutations with $\lambda$ dynamics (in T4 lysozyme) placed the protein backbone in the environment (with a single C$\alpha$ atom) and used patches for the side chains (which gives many C$\beta$ atoms). An immediately obvious limitation of this is that glycine and proline mutations cannot be treated with this approach because backbone parameters and connectivity change.

This approach works because the alchemical regions are connected to the environment in just one place, via the C$\alpha$-C$\beta$ bond, and this allows the contribution of the unscaled bonded interactions to be factored out of the partition function by a change of variables when $\lambda=0$. (There is a caveat here about angles.) As the free energy is proportional to the log of the partition function, their contribution to $\Delta G$ is the same in both ensembles and cancels out.

T4 Lysozyme Citation: DOI: <a href="https://doi.org/10.1002/pro.3500">10.1002/pro.3500</a>

#### I.C.2. Whole Residue Patches for Proline and Glycine

An obvious limitation of the side chain perturbation scheme is that backbone parameters and connectivity change for glycine and especially for proline mutations, so they cannot be treated with the side chain perturbation scheme. An obvious alternative is to scale the entire residue by $\lambda$. However, the alchemical group is now bonded to the rest of the system in two places, and unscaled bonds can no longer be factored out of the partition function when $\lambda=0$. Consider an example of a problem this causes: in mutating from proline to glycine, when proline is off, the ring topology still restricts rotation around the protein $\phi$ angle, so glycine loses entropy it ought to have in the unfolded ensemble, and the $\Delta\Delta G$ favors glycine more than it should, because the entropic cost of folding glycine is artificially reduced.

In order to scale the entire residue by $\lambda$ two additional techniques were developed. The first was a soft bond scaled by $\lambda$ in the proline side chain. This allows free rotation around the backbone $\phi$ angle when $\lambda=0$, and while the whole residue cannot be factored out of the partition function, at least the side chains and HN atoms can be.

The second development was to scale all the remaining backbone bonds which could not be factored out of the partition function by $\lambda$ so they do not contribute to the partition function at $\lambda=0$, and harmonically restrain these gas phase atoms to an analogous atom on another substituent so they do not drift off. These harmonic restraints can also be factored out of the partition function.

These two developments lead to two additional sections in the BLOCK module. The CATS section (Constrained ATom Scaling) lists groups of atoms that are restrained together whose bonded interactions are scaled by $\lambda$. There must be one atom from each substituent in each CATS selection. The SOBO section (SOft BOnds) list pairs of bonded atoms whose bond term (and associated angle terms) should be scaled by $\lambda$.

An additional complication of this approach is that adjacent mutations must have bonded terms added between mutated forms of the residues, necessitating a more complicated patching procedure. Fortunately, this patching is scripted.

DOI: <a href="https://doi.org/10.1002/jcc.26525">10.1002/jcc.26525</a>

## II. CHARMM-GUI Setup

Today's tutorial will focus on the three site mutant in T4 lysozyme (PDB: 1L63). The "native" sequence in this case already posesses two mutations to remove the disulfide bond. Within this background we will consider combinations of three mutations: I17M, I27M, and L33M. Experimental data exists for 6 of the 8 combinations:

```
17 27 33  ddG
 I  I  L  0.0
 I  I  M  2.0
 I  M  L  3.1
 I  M  M  3.05
 M  I  L  2.2
 M  I  M  ???
 M  M  L  ???
 M  M  M  3.3
```

Salt conditions are 100 mM NaCl, and temperature was 25$^{\circ}$ C.

### II.A. Patches

Experiments were performed at a pH of 5.4. ProPKA indicates no D or E residues are protonated at this pKa, but that the H is in the protonated state. There are no disulfide bonds to include. This means no patches need to be applied on CHARMM-GUI, but HIS31 should be "mutated" to HSP.

ProPKA may be downloaded and run through python, or can be run using CRIMM as a wrapper when one needs to determine protonation states in a new protein of interest. Disulfides can be extracted from the PDB header, by inspection, or using CRIMM.

The full sequence should have NTER and CTER caps; the peptides should each have ACE and CT3 caps.

Generate the folded structure and the three peptides on CHARMM-GUI. Note that you only need to generate the native sequence. Patches for alchemical mutations will be added later.

<a href="https://www.charmm-gui.org/?doc=input/solution">CHARMM-GUI</a>

## III. Setting Up prep Directories

The templates below are in CHARMM scripting language.

The first step is to finish equilibrating the output of CHARMM-GUI in the NVT and then NPT ensembles. (CHARMM-GUI only performs rudimentary minimization.) This is before we add any alchemy.

Also note that there is some discrepancy between the developer version of CHARMM and the release version of CHARMM in how water angles are autogenerated, so this initial step must remove and readd the waters.

In [None]:
import os, sys, shutil, subprocess
import alf
import numpy as np

gotcwd=os.getcwd()

chmguidir=gotcwd+'/charmm-gui-fold'
setupdir=gotcwd+'/charmm-gui-setup'
tooldir=gotcwd+'/charmm-gui-tools'

if not os.path.exists(setupdir):
  os.mkdir(setupdir)
if os.path.exists(setupdir+'/prep'):
  shutil.rmtree(setupdir+'/prep')
os.mkdir(setupdir+'/prep')

Copy the needed files from CHARMM-GUI output. Edit `toppar/toppar_water_ions.str` so that autogenerate will not add angles to water.

In [None]:
shutil.copy(chmguidir+'/step3_pbcsetup.psf',setupdir+'/prep/')
shutil.copy(chmguidir+'/step3_pbcsetup.crd',setupdir+'/prep/')
shutil.copy(chmguidir+'/step3_pbcsetup.str',setupdir+'/prep/')
shutil.copy(chmguidir+'/crystal_image.str',setupdir+'/prep/')

shutil.copy(chmguidir+'/toppar.str',setupdir+'/prep/')
if os.path.exists(setupdir+'/prep/toppar'):
  shutil.rmtree(setupdir+'/prep/toppar')
shutil.copytree(chmguidir+'/toppar',setupdir+'/prep/toppar')

# Fix TIP3 rtf
fpin=open(chmguidir+'/toppar/toppar_water_ions.str','r')
fpout=open(setupdir+'/prep/toppar/toppar_water_ions.str','w')
for line in fpin:
  if len(line.split())>=2 and line.split()[0]=='RESI' and line.split()[1]=='TIP3':
    fpout.write(' '.join(line.split()[0:3]+['NOANG','NODIH']+line.split()[3:])+'\n')
  else:
    fpout.write(line)
fpin.close()
fpout.close()

Copy needed scripts from template directory. `eqchmgui.inp` performs minimization, NVT simulation, and NPT simulation.

In [None]:
!cat charmm-gui-tools/eqchmgui.inp

if not os.path.exists(tooldir):
  print('Error missing tools directory')

shutil.copy(tooldir+'/nbond.str',setupdir+'/prep/')

shutil.copy(tooldir+'/eqchmgui.inp',setupdir+'/')

Run equilibration of the non-alchemical system

In [None]:
CHARMM=os.environ['CHARMMEXEC']

os.chdir(setupdir)
# Set OMP_NUM_THREADS=1 for BLaDE
subprocess.call(['mpirun','-np','1','-x','OMP_NUM_THREADS=1',CHARMM,'-i','eqchmgui.inp'])
os.chdir(gotcwd)

Make the alchemical prep/name.inp script. This script should work for most CHARMM-GUI, and streams files from aa_stream to do most of the alchemical setup.

If your system contains things besides water that autogeneration during alchemical patches will mess up (for example heme proteins), it is recommended to move those patches after `aa_stream/patchloop.inp`

In [None]:
!cat charmm-gui-tools/generic.inp

shutil.copy(tooldir+'/generic.inp',setupdir+'/prep/fold.inp')

Make the alf_info.py file

In [None]:
alf_info_str="""
import numpy as np
import os
alf_info={}
alf_info['name']='fold'
alf_info['nsubs']=[2,2,2]
alf_info['nblocks']=np.sum(alf_info['nsubs'])
alf_info['ncentral']=0
alf_info['nreps']=1
alf_info['nnodes']=1
alf_info['enginepath']=os.environ['CHARMMEXEC']
alf_info['temp']=298.15
"""
fp=open(setupdir+'/prep/alf_info.py','w')
fp.write(alf_info_str)
fp.close()

Define your alchemical mutations. These charmm script variables control the subsequent loops. aa_stream/README gives more instructions on how to set them. Every mutation site [i] should have resid[i] and segid[i] to indicate the segment and residue of the mutation. s[i]seq1 should always be 0 for the native residue. Subsequent mutations [j] are listed by their one letter amino acid code in s[i]seq[j].

Mutations near termini are especially tricky to patch, so several additional variables are defined to control this. These 8 termini variables should be defined for every segid [s] being mutated, here only proa. nterdel_[s] and cterdel_[s] are 0 unless the length of your sequence is changing. nterres_[s] and cterres_[s] are the resids of the first and last residue of the chain. In a pentapeptide, these are resid1-2 and resid1+2 (unless the mutation is at the very end of the chain, then it's just the actual termius.) ntercap_[s] and ctercap_[s] are the capping patches. Only NTER, CTER, ACE, and CT3 are supported. (These caps have other names for proline and glycine, use the generic name here, not the specific name.) nterc_[s] and cterc_[s] are one character codes to represent the previous four caps.

In [None]:
alchemical_definitions_str="""
! List all mutation sites and mutants at each site
! j is hsp
set resid1 = 17
set s1seq1 = 0 ! i
set s1seq2 = m
set segid1 = PROA
set resid2 = 27
set s2seq1 = 0 ! i
set s2seq2 = m
set segid2 = PROA
set resid3 = 33
set s3seq1 = 0 ! l
set s3seq2 = m
set segid3 = PROA

! Set the terminal properties of any segid mutated above
set nterdel_proa = 0 ! 0 means don't do it
set nterres_proa = 1
set ntercap_proa = nter
set nterc_proa = 2 ! 2 nter, 3 cter, 4 ace, 5 ct3
set cterdel_proa = 0 ! 0 means don't do it
set cterres_proa = 162
set ctercap_proa = cter
set cterc_proa = 3  ! 2 nter, 3 cter, 4 ace, 5 ct3

! Only modify aainitl and aafinal if you're mutating things besides proteins
set aainitl = 0
set aafinal = @nsites
"""
fp=open(setupdir+'/prep/alchemical_definitions.inp','w')
fp.write(alchemical_definitions_str)
fp.close()

Copy in the alchemical stream files and patch files.

* patchloop.inp : Adds patches for the mutating residues, adds further patches to link those residues to the neighboring residues correctly. Generates initial positions for mutated atoms from internal coordinates based on positions of native atoms.
* selectloop.inp : Defines the alchemical selections used in the block module.
* deleteloop.inp : Deletes the spurious intrasite, intersubstituent angles generated by autogenerate.
* blocksetup.inp : Sets up block module with Call, Cats, ldin, ldbv, and sobo calls.

In [None]:
! cat aa_stream/patchloop.inp

In [None]:
! cat aa_stream/selectloop.inp

In [None]:
! cat aa_stream/deleteloop.inp

In [None]:
! cat aa_stream/blocksetup.inp

In [None]:
if os.path.exists(setupdir+'/prep/aa_stream'):
  shutil.rmtree(setupdir+'/prep/aa_stream')
shutil.copytree('aa_stream',setupdir+'/prep/aa_stream')

Do two cycles of flattening to make sure setup worked

In [None]:
os.chdir(setupdir)
sys.path.insert(0,'') # so alf can find prep after os.chdir
alf.initialize(engine='bladelib')
alf.runflat(1,2,13000,39000,engine='bladelib')
os.chdir(gotcwd)

Copy the prep directory to `t4l_fold` to use it. (Or copy the bash files from there to `charmm-gui-setup` to continue from the two cycles we already ran.)

In [None]:
os.chdir(gotcwd)
# if os.path.exists(gotcwd+'/t4l_fold/prep'):
#   shutil.rmtree(gotcwd+'/t4l_fold/prep')
# shutil.copytree(setupdir+'/prep',gotcwd+'/t4l_fold/prep')
os.system('rm -r t4l_fold/prep')
os.system('cp -r charmm-gui-setup/prep t4l_fold/prep')

## IV. Running ALF

In [None]:
os.environ['SLURMOPTSMD']='--time=240 --ntasks=1 --tasks-per-node=1 --cpus-per-task=1 -p gpu -A rhayes1_lab_gpu --gres=gpu:1 --export=ALL'
os.environ['SLURMOPTSPP']='--time=240 --ntasks=1 --tasks-per-node=1 --cpus-per-task=1 -p gpu -A rhayes1_lab_gpu --gres=gpu:1 --export=ALL'

In [None]:
os.chdir(gotcwd+'/t4l_fold')
!./subsetAll.sh
os.chdir(gotcwd)

## V. Analyzing Results

Apply independent peptide approximation to get the free energy of the unfolded ensemble.

In [None]:
!echo I17M
!cat prerun/t4l_i17m/Result.txt
!echo I27M
!cat prerun/t4l_i27m/Result.txt
!echo L33M
!cat prerun/t4l_l33m/Result.txt

import math

fpin=[]
fpin.append(open("prerun/t4l_i17m/Result.txt","r"))
fpin.append(open("prerun/t4l_i27m/Result.txt","r"))
fpin.append(open("prerun/t4l_l33m/Result.txt","r"))
fpout=open("prerun/ResultU.txt","w")

i1=[]
V=[]
E=[]
indices=[]
for i in range(0,len(fpin)):
  i1.append([])
  V.append([])
  E.append([])
  indices.append(0)
  lines=fpin[i].readlines()
  for j in range(0,len(lines)):
    line=lines[j].split()

    i1[i].append(int(line[0]))
    V[i].append(float(line[1]))
    E[i].append(float(line[3]))

while indices[0]<len(i1[0]):
  Vi=0
  Ei=0
  for i in range(0,len(fpin)):
    fpout.write("%2d " % (i1[i][indices[i]],))
    Vi+=V[i][indices[i]]
    Ei=math.sqrt(Ei**2 + E[i][indices[i]]**2)
  fpout.write("%8.3f +/- %5.3f\n" % (Vi,Ei))
  for i in range(0,len(fpin)):
    indices[len(fpin)-1-i]+=1
    if indices[len(fpin)-1-i]==len(i1[len(fpin)-1-i]) and (len(fpin)-1-i)!=0:
      indices[len(fpin)-1-i]=0
    else:
      break

for i in range(0,len(fpin)):
  fpin[i].close()
fpout.close()

!echo Unfolded ensemble
!cat prerun/ResultU.txt

In [None]:
!echo folded
!cat prerun/t4l_fold/Result.txt

import math

fp1=open("prerun/ResultU.txt","r")
fp2=open("prerun/t4l_fold/Result.txt","r")
fp3=open("prerun/Result.txt","w")

lines1=fp1.readlines()
lines2=fp2.readlines()

nsites=len(lines1[0].split())-3
for i in range(0,len(lines1)):
  line1=lines1[i].split()
  line2=lines2[i].split()

  i1=[]
  for j in range(0,nsites):
    i1.append(int(line1[j]))
  V=float(line2[nsites])-float(line1[nsites])
  E=math.sqrt(float(line2[nsites+2])**2 + float(line1[nsites+2])**2)

  for j in range(0,nsites):
    fp3.write("%2d " % (i1[j],))
  fp3.write("%8.3f +/- %5.3f\n" % (V,E))

fp1.close()
fp2.close()
fp3.close()

!echo ddG
!cat prerun/Result.txt

At this point, one could consider computing Pearson's correlation and root mean squared error with the experimental values

```
17 27 33  ddG
 I  I  L  0.0
 I  I  M  2.0
 I  M  L  3.1
 I  M  M  3.05
 M  I  L  2.2
 M  I  M  ???
 M  M  L  ???
 M  M  M  3.3
```

## VI. Exercises

* Set up L33M peptide on CHARMM-GUI
* Run L33M peptide with ALF