# Workshop #4: Folding

Any folding algorithm requires…


- …a search strategy, an algorithm to generate many candidate structures (or decoys) and…


- …a scoring function to discriminate near-native structures from all the others.


In this workshop you will write your own Monte Carlo protein folding algorithm from scratch, and we will explore a couple of the tricks used by Simons et al. (1997, 1999) to speed up the folding search.


### Suggested Readings
1. K. T. Simons et al., “Assembly of Protein Structures from Fragments,” *J. Mol. Biol.*
268, 209-225 (1997).
2. K. T. Simons et al., “Improved recognition of protein structures,” *Proteins* 34, 82-95
(1999).
3. Chapter 4 (Monte Carlo methods) of M. P. Allen & D. J. Tildesley, *Computer
Simulation of Liquids*, Oxford University Press, 1989.

## Building the Pose

In this workshop, you will be folding a 10 residue protein by building a simple de novo folding algorithm. Start by initializing PyRosetta as usual.

In [1]:
from pyrosetta import *
pyrosetta.init()

[0mcore.init: [0mChecking for fconfig files in pwd and ./rosetta/flags
[0mcore.init: [0mRosetta version: PyRosetta4.Release.python36.mac r208 2019.04+release.fd666910a5e fd666910a5edac957383b32b3b4c9d10020f34c1 http://www.pyrosetta.org 2019-01-22T15:55:37
[0mcore.init: [0mcommand: PyRosetta -ex1 -ex2aro -database /Users/kathyle/Computational Protein Prediction and Design/PyRosetta4.Release.python36.mac.release-208/pyrosetta/database
[0mcore.init: [0m'RNG device' seed mode, using '/dev/urandom', seed=459134809 seed_offset=0 real_seed=459134809
[0mcore.init.random: [0mRandomGenerator:init: Normal mode, seed=459134809 RG_type=mt19937


Create a simple poly-alanine `pose` with 10 residues for testing your folding algorithm. Store the pose in a variable called "polyA."

In [4]:
### BEGIN SOLUTION
polyA = pyrosetta.pose_from_sequence('A' * 10)
### END SOLUTION

polyA.pdb_info().name("polyA")

__Question:__
Check the backbone dihedrals of a few residues (except the first and last) using the `.phi()` and `.psi()` methods in `Pose`. What are the values of $\phi$ and $\psi$ dihedrals? You should see ideal bond lengths and angles, but the dihedrals may not be as realistic.

In [14]:
### BEGIN SOLUTION
print("phi: %i" %polyA.phi(9))
print("psi: %i" %polyA.psi(9))
### END SOLUTION

phi: 180
psi: 180


We want to visualize folding as it happens. Before starting with the folding protocol, instantiate a PyMOL mover and use a UNIQUE port number between 10,000 and 65,536. We will retain history in order to view the entire folding process by utilizing the `.keep_history()` method. Make sure it says `PyMOL <---> PyRosetta link started!` on its command line.

In [19]:
pmm = PyMOLMover()
pmm.keep_history(True)


Use the PyMOL mover to view the `polyA` `Pose`. You should see a long thread-like structure in PyMOL.

In [20]:
pmm.apply(polyA)

## Building A Basic *de Novo* Folding Algorithm

Now, write a program that implements a Monte Carlo algorithm to optimize the protein conformation. You can do this here in the notebook, or you may use a code editor to write a `.py` file and execute in a Python or iPython shell.  

Our main program will include 100 iterations of making a random trial move, scoring the protein, and accepting/rejecting the move. Therefore, we can break this algorithm down into three smaller subroutines: **random, score, and decision.**

For the **random** trial move, write a subroutine to choose one residue at random using `random.randint()` and then randomly perturb either the φ or ψ angles by a random number chosen from a Gaussian distribution. Use the Python built-in function `random.gauss()` from the `random` library with a mean of the current angle and a standard deviation of 25°. After changing the torsion angle, use `pmm.apply(polyA)` to update the structure in PyMOL.

In [25]:
import math
import random
from pyrosetta.teaching import *

### BEGIN SOLUTION
def randTrial(your_pose):
    randNum = random.randint(1, your_pose.total_residue() + 1)
    currPhi = your_pose.phi(randNum)
    currPsi = your_pose.psi(randNum)
    newPhi = random.gauss(currPhi, 25)
    newPsi = random.gauss(currPsi, 25)
    your_pose.set_phi(randNum,newPhi) 
    your_pose.set_psi(randNum,newPsi)
    pmm.apply(your_pose)
    return your_pose
### END SOLUTION

For the **scoring** step, we need to create a scoring function and make a subroutine that returns the numerical energy score of the pose.

In [29]:
sfxn = get_fa_scorefxn()

def score(your_pose):
    ### BEGIN SOLUTION
    return sfxn(your_pose)
    ### END SOLUTION

[0mcore.scoring.ScoreFunctionFactory: [0mSCOREFUNCTION: [32mref2015[0m
[0mcore.scoring.etable: [0mStarting energy table calculation
[0mcore.scoring.etable: [0msmooth_etable: changing atr/rep split to bottom of energy well
[0mcore.scoring.etable: [0msmooth_etable: spline smoothing lj etables (maxdis = 6)
[0mcore.scoring.etable: [0msmooth_etable: spline smoothing solvation etables (max_dis = 6)
[0mcore.scoring.etable: [0mFinished calculating energy tables.
[0mbasic.io.database: [0mDatabase file opened: scoring/score_functions/hbonds/ref2015_params/HBPoly1D.csv
[0mbasic.io.database: [0mDatabase file opened: scoring/score_functions/hbonds/ref2015_params/HBFadeIntervals.csv
[0mbasic.io.database: [0mDatabase file opened: scoring/score_functions/hbonds/ref2015_params/HBEval.csv
[0mbasic.io.database: [0mDatabase file opened: scoring/score_functions/hbonds/ref2015_params/DonStrength.csv
[0mbasic.io.database: [0mDatabase file opened: scoring/score_functions/hbonds/ref2015

For the **decision** step, we need to make a subroutine that either accepts or rejects the new conformatuon based on the Metropolis criterion. The Metropolis criterion has a probability of accepting a move as $P = \exp( -\Delta G / kT )$. When $ΔE ≥ 0$, the Metropolis criterion probability of accepting the move is $P = \exp( -\Delta G / kT )$. When $ΔE < 0$, the Metropolis criterion probability of accepting the move is $P = 1$. Use $kT = 1$ Rosetta Energy Unit (REU).

In [30]:
def decision(before_pose, after_pose):
    ### BEGIN SOLUTION
    E = sfxn(after_pose) - sfxn(before_pose)
    if E < 0:
        return after_pose
    elif random.uniform(0, 1) >= math.exp(-E/1):
        return before_pose
    ### END SOLUTION

Now we can put these three subroutines together in our main program! Write a loop in the main program so that it performs 100 iterations of: making a random trial move, scoring the protein, and accepting/rejecting the move. 

After each iteration of the search, output the current pose energy and the lowest energy ever observed. **The final output of this program should be the lowest energy conformation that is achieved at *any* point during the simulation.** Be sure to use `low_pose.assign(pose)` rather than `low_pose = pose`, since the latter will only copy a pointer to the original pose.

In [None]:
# Create an empty pose.
lowest_pose = Pose()

### BEGIN SOLUTION

### END SOLUTION

In [None]:
def basic_folding(your_pose):
    """Your basic folding algorithm that completes 100 Monte-Carlo iterations on a given pose"""
    ### BEGIN SOLUTION
    for i in range(100):
        randTrial()
        score()
        decision()
    ### END SOLUTION

Finally, output the last pose and the lowest-scoring pose observed and view them in PyMOL. Plot the energy and lowest-energy observed vs. cycle number. What are the energies of the initial, last, and lowest-scoring pose? Is your program working? Has it converged to a good solution?


Using the program you wrote for Workshop #2, force the $A_{10}$ sequence into an ideal α-helix.

**Questions:** Does this helical structure have a lower score than that produced by your folding algorithm above? What does this mean about your sampling or discrimination?

Since your program is a stochastic search algorithm, it may not produce an ideal structure consistently, so try running the simulation multiple times or with a different number of cycles (if necessary). Using a kT of 1, your program may need to make up to 500,000 iterations.

## Low-Resolution (Centroid) Scoring


Following the treatment of Simons *et al.* (1999), Rosetta can score a protein conformation using a low-resolution representation. This will make the energy calculation faster.

Load chain A of Ras, a protein from a the previous workshop 3. Also calculate the full-atom energy of the pose.

```
pose = pyrosetta.pose_from_pdb("6Q21_A.pdb")
sfxn = pyrosetta.get_score_function()
sfxn(pose)
```

**Question:** Print the coordinates of residue 5 using. Note the number of atoms and coordinates of residue 5.

```
print(pose.residue(5))
```

Now, convert the `pose` to the centroid form by using a `SwitchResidueTypeSetMover` object and the apply method:

```
switch = SwitchResidueTypeSetMover("centroid")
switch.apply(pose)
print(pose.residue(5))
```

**Question:** How many atoms are now in residue 5? How is this different than before switching it into centroid mode?

Score the new, centroid-based pose by creating and using the standard centroid score function "score3".

```
cen_sfxn = pyrosetta.create_score_function("score3")
cen_sfxn(pose)
```

**Question:** What is the new total score? What scoring terms are included in "score3" (`print` the `cen_sfxn`)? Do these match Simons?

Convert the `pose` back to all-atom form by using another switch object, `SwitchResidueTypeSetMover("fa_standard")`.

```
fa_switch = SwitchResidueTypeSetMover("fa_standard")
fa_switch.apply(pose)
print(pose.residue(5))
```

**Question:** Confirm that you have all the atoms back. Are the atoms in the same coordinate position as before?

Go back and adjust your folding algorithm to use centroid mode. Create a `ScoreFunction` that uses only van der Waals (`fa_atr` and `fa_rep`) and `hbond_sr_bb` energy score terms. 

**Question:** How much faster does your program run?

### Note about `Movers`

Not counting the `PyMOLMover`, which is a special case, `SwitchResidueTypeSetMover` is the first example we have seen of a `Mover` class in PyRosetta. Every `Mover` object in PyRosetta has been designed to apply specific and complex changes (or “moves”) to a `pose`. Every `Mover` must be “constructed” and have any options set before being applied to a `pose` with the `apply()` method. `SwitchResidueTypeSetMover` has a relatively simple construction with only the single option `"centroid"`. (Some `Movers`, as we shall see, require no options and are programmed to operate with default values).

## Protein Fragments


In your terminal, look at the provided `3mer.frags` fragments (look in the `class` directory on Polander). These fragments are generated from the Robetta server (http://robetta.bakerlab.org/fragmentsubmit.jsp) for a given sequence. You should see sets of three-lines describing each fragment.

**Questions:** For the first fragment, which PDB file does it come from? Is this fragment helical, sheet, in a loop, or a combination? What are the φ, ψ, and ω angles of the middle residue of the first fragment window?

Create a new subroutine in your folding code for an alternate random move based upon a “fragment insertion”. A fragment insertion is the replacement of the torsion angles for a set of consecutive residues with new torsion angles pulled at random from a fragment library file. Prior to calling the subroutine, load the set of fragments from the fragment file:

```
from pyrosetta.rosetta.core.fragment import *
fragset = ConstantLengthFragSet(3)
fragset.read_fragment_file("3mer.frags")
```

Next, we will construct another `Mover` object — this time a `FragmentMover` — using the above fragment set and a `MoveMap` object as options. A `MoveMap` specifies which degrees of freedom are allowed to change in the `pose` when the `Mover` is applied (in this case, all backbone torsion angles):

```
from pyrosetta.rosetta.protocols.simple_moves import ClassicFragmentMover
movemap = MoveMap()
movemap.set_bb(True)
mover_3mer = ClassicFragmentMover(fragset, movemap)
```

Note that when a MoveMap is constructed, all degrees of freedom are set to False initially. If you still have a *PyMOL_Mover* instantiated, you can quickly visualize which degrees of freedom will be allowed by sending your move map to PyMOL with 

```
test_pose = pyrosetta.pose_from_sequence("RFPMMSTFKVLLCGAVLSRIDAG")
pmm.apply(test_pose)
pmm.send_movemap(test_pose, movemap)
```

Each time this mover is applied, it will select a random 3-mer window and insert only the backbone torsion angles from a random matching fragment in the fragment set. Here is an example using the above `test_pose`:

```
mover_3mer.apply(test_pose)
pmm.apply(test_pose)
```

**Question:** When you change your random move in your poly-alanine folding algorithm to a fragment insertion, how much faster is your protocol? Does it converge to a protein-like conformation more quickly?

### Programming Exercises

- Fold a 10-mer poly-alanine using 100 independent trajectories, using any variant of the folding algorithm that you like. (A trajectory is a path through the conformation space traveled during the calculation. The end result of each independent trajectory is called a “decoy”. Given enough sampling, the lowest energy decoy may correspond to the global minimum.) Create a Ramachandran plot using the lowest-scoring conformations (decoys) from all 100 independent trajectories. Repeat this for a 10-mer poly-glycine. How do the plots differ? Compare with the plots in Richardson’s article.


- Test your folding program’s ability to predict a real fold from scratch. Choose a small protein to keep the computation time down, such as Hox-B1 homeobox protein (1B72) or RecA (2REB). How many iterations and how many independent trajectories do you need to run to find a good structure?


- Modify your folding program to include a simulated annealing temperature schedule, decaying exponentially from kT = 100 to kT = 0.1 over the course of the search. Again, fold a test protein. Does this approach work better?
Modify your folding program to remove the Metropolis criterion and instead accept trial moves only when the energy decreases. Plot energy vs. iteration and examine the final output structures from multiple runs. How is the convergence and performance affected? Why?


### Thought Questions

- **[Introductory]** What are the limitations of these types of folding algorithms?


- **[Advanced]** How might you design an intermediate-resolution representation of side chains that has more detail than the centroid approach yet is faster than the full-atom approach? Which types of residues would most benefit from this type of representation?