# Protein and Genetic Engineering

### P1 - Working with poses

#### Introduction

When working with protein structures, it is essential to access the structural data straightforwardly. This information includes data about the protein's composition, topology, coordinates, etc. With visualization, which helps to gain information about the protein's structure spatial organization, programs can carry out many valuable calculations to further our understanding of the macromolecular system and its functions.
[PyRosetta](http://www.pyrosetta.org/) is a library that allows to model and design of macromolecular structures. It is written in Python and is based on a more extensive program called [Rosetta](https://www.rosettacommons.org/). Here, we review some basic concepts to learn how to access and manipulate protein's structural data using this library.

#### Importing and initializing Rosetta

First, we start by importing the library's content in our Jupyter notebook:

In [None]:
from pyrosetta import *
init()

#### Loading a protein structure as a ```Pose``` object

PyRosetta reads PDB files into a ```Pose()``` object. This is a special class with several methods about the protein's structure. We start by initilizing this class using as input a PDB file contained in the input folder:

In [None]:
pose = pose_from_pdb('input/5TJ3.pdb')

Our ```pose``` variable now reference the initialized instance of a PyRosetta's ```Pose()``` class. We can now access several attributes and methods inside this class. We note that a large output text is written upon loading the PDB file into a Pose class object. This output reflects details about the process of parsing the information contained in the PDB file. An important piece of information is the missing atoms in the PDB file, which are directly inferred from each residue's atomic composition. Since an incomplete residue inside the protein is senseless, the library automatically builds coordinates for the missing atoms based on known (and ideal) atom distance geometries for each residue type.

We can find details about the ```Pose``` obecjt by calling the ```help()``` function on the object

In [None]:
help(pose)

### Accessing the ```Pose``` object sequence information

We can access the protein sequence from the ```pose``` object:

In [None]:
print(pose.sequence())

We can see at the end of the sequence there are four 'Z' residues. These residues correspond to Zinc atoms in the protein. We can create a protein structure without any other molecule or ions by calling the ```cleanATOM()```, inside PyRosetta's toolbox, method upon the PDB file:

In [None]:
from pyrosetta.toolbox import cleanATOM

In [None]:
cleanATOM('input/5TJ3.pdb')

If we now check the content of our "input" folder, we can see that a new file has been written down. The file name has the string .clean. inserted and represents the protein structure without any non-protein atom. We can load this structure into a new ```Pose()``` instance:

In [None]:
pose_clean = pose_from_pdb('input/5TJ3.clean.pdb')

Let's print the sequence of this new PDB file:

In [None]:
print(pose_clean.sequence())

Notice how the two printed sequences differ between each other. The "clean" sequence does not contain the Zn atoms anymore. We can also device an easy algorithm to get the difference betwee this two sequences:

In [None]:
### Get missing residues from poses ###

# Build an alignment for the sequences of each pose
sa = rosetta.core.sequence.align_poses_naive(pose, pose_clean)

# Get the aligned sequences for each pose
aligned_pose = sa.sequence(1).sequence()
aligned_pose_clean = sa.sequence(2).sequence()

# Create auxiliary variables for count each sequences' positions. 
c1 = 0
c2 = 0

# Iterate each pair of aligned positions
for i, z in enumerate(zip(aligned_pose, aligned_pose_clean)):

    # Count the number of residues for the "pose" sequence
    if z[0] != '-':
        c1 += 1
        
    # print if the "pose" sequence is missing an aligned position
    else:
        print('pose is missing a '+z[1]+' in position '+str(c2))
        
    # Count the number of residues for the "pose_clean" sequence
    if z[1] != '-':
        c2 += 1
        
    # print if the "pose_clean" sequence is missing an aligned position
    else:
        print('pose_clean is missing a '+z[0]+' in position '+str(c1))

We can see that the "pose_clean" is missing a threonine (T) residue at position 56 (and we also see that the aligned sequence only contains protein atoms). 

### Accessing the ```Pose``` object chain and residue information

We can check why our "pose_clean" is missing a protein residue by looking in more detail at our annotated sequences. First let's get the number of residues in each pose:

In [None]:
print(pose.total_residue())
print(pose_clean.total_residue())

We see we are missing five residues in our "pose_clean" pose. Can you name them?

Now we print the annotated sequence, which gives further details about the residue types in each sequence:

In [None]:
print(pose.annotated_sequence())

In [None]:
print(pose_clean.annotated_sequence())

Can you see now why are we missing the threonine residue in position 56?

We can access individual residues by indexing:

In [None]:
# Get first residue
first_residue =  pose.residue(1)

# Print first residue name
print(first_residue.name())

In [None]:
# Get residue 56
missing_residue =  pose.residue(56)

# Print residue 56 name
print(missing_residue.name())

In the next blank cell call the help() function upon our ```missing_residue``` object:

Besides accessing residues by name we can access other structure objects, like chains:

In [None]:
print(pose.pdb_info().chain(1))
print(pose.pdb_info().number(1))

We can also access residues with the PDB indexing information. First we use the ```pdb2pose()``` method to get the index of a particular residue:

In [None]:
# PDB numbering to Pose numbering
residue_index = pose.pdb_info().pdb2pose('A', 79)

print(residue_index)
print(pose.residue(residue_index))

There is other kind of information we can give get from a residue object. Different residues has differnt attributes (e.g., charged, aromatic, etc...). There are methods inside a residue object that allow us to query if a particular property is a part of the residue object:

In [None]:
res_28 = pose.residue(28)
print(res_28.name())
print(res_28.is_charged())

You can use autocomplete to see other proteperties that can be queried with this type of methods. In the next blank cell write:
    
res_28.is_

Then press the tab key to see the suggestions there are available

Try an load a Zinc residue an query if it is a metal (.is_metal?):

The residue object contains information about its atom. We can access the any atom index by using the atom's name:

In [None]:
print(res_28.atom_index('CA'))

In [None]:
res_28_CA = res_28.atom(2)
print(res_28_CA)

### Accessing the ```Residue``` object geometrical information

The most important angles that describve the proteins' backbone geometry are the phi and psi angles. To access these angles for a specific residue we call the ```phi()``` or ```psi()``` methods, from the ```Pose()``` object, with the corresponding residue index:

In [None]:
print("phi:", pose.phi(28))
print("psi:", pose.psi(28))

Analogously, the torsions for the sidechain can be called with the ```chi()``` method, but using two indexes; one for the chi angle, and other for the residue:

In [None]:
print("chi1:", pose.chi(1, 28))
print("chi2:", pose.chi(2, 28))
print("chi3:", pose.chi(3, 28))
print("chi4:", pose.chi(4, 28))

To find out the length of specific bonds in the structure we need to create an instance of the ```conformation()``` class. This class can be used to query the length of a specific bond (```bond_length()``` method):

In [None]:
conformation = pose.conformation()

In [None]:
conformation.bond_length?

Let's first select a residue and some atoms inside it:

In [None]:
resid = 28
res_28 = pose.residue(resid)
N28 = AtomID(res_28.atom_index("N"), resid)
CA28 = AtomID(res_28.atom_index("CA"), resid)
C28 = AtomID(res_28.atom_index("C"), resid)

The ```AtomID()``` class, helps to track the atom and residue index of a specific atom. This object is used to reference the atom by other functions inside pyRosetta:

In [None]:
print(N28)

Now let's use the ```the bond_length()``` method to calculate some bonded distances:

In [None]:
print(pose.conformation().bond_length(N28, CA28))
print(pose.conformation().bond_length(CA28, C28))

This angle calculation is equivalent to take the distance between the two coordinates (position) vectors. We can get the coordinates of each atom by calling the ```xyz()``` method inside the ```Residue()``` object:

In [None]:
# Get atom's coordinates
N_xyz = res_28.xyz("N")
CA_xyz = res_28.xyz("CA")
C_xyz = res_28.xyz("C")

# Get the difference vectors
N_CA_vector = CA_xyz - N_xyz
CA_C_vector = CA_xyz - C_xyz

# Calculate the norm (length) of each difference vector
print(N_CA_vector.norm())
print(CA_C_vector.norm())

We note that each coordinate is a special ```Vector()``` class that contains special methods to facilitate vector operations.

We can repeat the above process to calculate the angle between three connected atoms:

In [None]:
angle = pose.conformation().bond_angle(N28, CA28, C28)
print(angle)

This angle is in radians, we can transform it in to degree values by using the $\pi$ number:

In [None]:
import math
print(math.pi)

In [None]:
print(angle*180/math.pi)

The formula to estimate the angle between two vectors A and B is:
    
$\cos(\theta)=\frac{A\cdot B}{|A||B|}$

Can you use above's formula to check the prvious result?

### Link PyMol to PyRosetta

First, we will modify the PyMol molecular visualization program. First you need to find where the ```PyMOLRosettaServer.py``` script is located in your Conda installation of PyRosetta. Go to your Conda installation directory and execute:


```
find . -name PyMOLRosettaServer.py
```

This will find the script's location by matching itd name against all the files in the current directry. Now copy the path to the ```PyMOLRosettaServer.py``` script.

Next go to your home directory and modify the file ```.pymolrc``` adying the following line:

```
run path_to_the_PyMOLRosettaServer.py_script
```

Where, ```path_to_the_PyMOLRosettaServer.py_script``` is the previously copied path.

Save the file and start pymol in a terminal. If everything is correct you should see the ```PyMOL <---> PyRosetta link started!``` in the terminal or the PyMol command line window.

#### Load a ```Pose()``` into PyMol

We will use an instance of the ```PyMOLMover()``` class. 

In [None]:
pymol_mover = PyMOLMover()
pymol_mover.keep_history(True) # Do we keep all frames or just keep the last one?

The ```PyMOLMover()``` class can load a Pose directly into PyMol's visualization interface. We load our previous Pose() object by calling the method ```apply()``` from the ```PyMOLMover()``` instance:

In [None]:
pymol_mover.apply(pose)

We can check hydrogen bond patters directly in PyMol with the following command:


In [None]:
pymol_mover.send_hbonds(pose)

Before continuing, let's restart Pymol by closing and opening a new instance.

### Modifying the Pose's geometry

Now that we know how to access basic geometrical information in a ```Pose``` object, we move to manipulate its geometry.

We create the instance:``` creating a ```Pose()``` object directly from an amino acid sequence using the function ```pose_from_sequence()```:

In [None]:
# Create a three peptide 
tripeptide = pose_from_sequence("AAA")

Let's print the phi and psi angles and coordinates of the 'CB' carbon of the center residue (2) of this newly created ```Pose()```.

In [None]:
orig_phi = tripeptide.phi(2)
orig_psi = tripeptide.psi(2)
print("original phi:", orig_phi)
print("original psi:", orig_psi)

print("xyz coordinates:", tripeptide.residue(2).xyz("CB"))

We see that the phi and psi angles are set to 180º when a pose is created from an amino acid sequence. 

Let's now load the display into PyMol

In [None]:
pymol_mover.apply(tripeptide)

We can set specific angles to arbitrary values by using the ```set_phi()``` method inside the ```Pose()``` object:

In [None]:
# Set the phi angle to 90 degrees
tripeptide.set_phi(2, 90)

# Print the phi and psi values after the change
new_phi = tripeptide.phi(2)
new_psi = tripeptide.psi(2)
print("new phi:", new_phi)
print("new psi:", new_psi)

print("xyz coordinates:", tripeptide.residue(2).xyz("CB"))

We load now this into Pymol

In [None]:
pymol_mover.apply(tripeptide)

We repeat the same with psi torsion angle by using the ```set_psi()``` method inside the ```Pose()``` object:

In [None]:
# Set the psi angle to 90 degrees
tripeptide.set_psi(2, 90)

# Load into Pymol
pymol_mover.apply(tripeptide)

Let's use:

```
File -> Reinitialize -> Everything
```

to reset Pymol.

Now let's use a loop to set the second phi angle to all integer degrees. 

In [None]:
import time

In [None]:
# Iterate from 0 to 359
for i in range(0, 360, 1):
    
    # Set the phi angle to that number
    tripeptide.set_phi(2, i)
    
    # Send pose to Pymol at each iteration
    pymol_mover.apply(tripeptide)
    
    time.sleep(0.001) # Delay each send to Pymol

Reinitialize PyMol and now change the psi angle

In [None]:
for i in range(0, 360, 1):
    tripeptide.set_psi(2, i)
    pymol_mover.apply(tripeptide)
    time.sleep(0.001)

### Creating a randomly perturbation mover

We are now going to create a mover that randomly perturbs the structure's phi and psi angles by a defined magnitude. We first create the ```tripeptide``` ```Pose()``` again:

In [None]:
tripeptide = pose_from_sequence("AAA")

Now we import numpy and define the random perturbation function:

In [None]:
import numpy as np

In [None]:
def perturb_random_angle(pose, max_rot=6):
    
    # Define the perturbation magnitude
    magnitude = np.random.uniform(low=-max_rot, high=max_rot)
    
    #Chose a random angle to perturb between phi and psi
    angle = np.random.choice(['phi', 'psi'])
    
    # Choose a random residue to perturb
    residues = range( 1 , pose.total_residue()  + 1 )
    residue = np.random.choice(residues)
    
    # Perturb the selected angle by the defined magnitude
    if angle == 'phi':
        orig_phi = pose.phi(residue)
        pose.set_phi(residue, orig_phi+magnitude)
        
    elif angle == 'psi':
        orig_psi = pose.psi(residue)
        pose.set_psi(residue, orig_psi+magnitude)

Our function selects a random residue and a random angle to be perturbed in the pose. The magnitude of the perturbation is selected among all possible values between -max_rot and max_rot keyword value. The modification of the angle is done on top of the previous phi or psi angle of the particular residue; this ensure that the function makes a true perturbation, and not sets the torsion to the perturbation magnitude value. 

Let's now apply this mover 1000 times and load into PyMol

In [None]:
pymol_mover.apply(tripeptide)
for i in range(1000):
    perturb_random_angle(tripeptide, max_rot=6)
    pymol_mover.apply(tripeptide)
    time.sleep(0.01)

### Creating a Ramachandran Plot

The Ramachandran plot is a plot of all the protein's phi and psi angle values. We start by import plotting methods from the ```matplotlib``` library:

In [None]:
import matplotlib.pyplot as plt

Now we iterate all the protein's residues and store the phi and psi angle values into lists:

In [None]:
# Create a list of all residue indexes
residues = range( 1 , pose_clean.total_residue()  + 1 )

# Define two empty lists to store the protein-'s phi and psi values
phi_values = []
psi_values = []

# Iterate each residue and get their torsional values
for i in residues:
    phi_values.append(pose_clean.phi(i))
    psi_values.append(pose_clean.psi(i))

Now that we have the phi and psi values we create a scatter plot of these values

In [None]:
# Define resolution and figure size
plt.figure(dpi=100, figsize=(4,4))

# Plot the phi and psi values as a scatter plot
plt.scatter(phi_values, psi_values, c='k', s=5)

# Generate labels for each axis
plt.xlabel('$\phi$')
plt.ylabel('$\psi$')

# Define the plot x and y limits
plt.xlim(-180,180)
plt.ylim(-180,180)

# Set a title
plt.title('My first Ramachandran plot')

We could redo the plot with the secondary structure information of each residue. First we calculate the secondary structure content of each residue and then we plot separatedly the Ramachandran points of each secondary structure type. Let's import a method to calculate the secondary structure of the ```Pose()```. 

In [None]:
from pyrosetta.rosetta.protocols.moves import DsspMover

We create an instance of the ```DsspMover()``` method, which populates the ```secstruct``` attribute of the ```Pose()```:

In [None]:
# Get secondary structure mover
DSSP = DsspMover()
DSSP.apply(pose) # populates the pose's Pose.secstruct
SS = pose.secstruct()
print(SS)

Finally we plot the Ramachandran plot by grou of secondary structure type:

In [None]:
# Get residues indexes on the clean pose
residues = range( 1 , pose_clean.total_residue()  + 1 )

# Create dictionary to store phi values
phi_values = {}
phi_values['L'] = []
phi_values['E'] = []
phi_values['H'] = []

# Create dictionary to store psi values
psi_values = {}
psi_values['L'] = []
psi_values['E'] = []
psi_values['H'] = []

# Define colors for each SS type
color = {
    'L' : 'b',
    'E' : 'r',
    'H' : 'g',
}

# Iterate the residues
for i in residues:
    # Get the SS type for the ith residue
    ss = SS[i-1]
    
    # Gert the phi values and store them in the dictionary's lists
    phi_values[ss].append(pose_clean.phi(i))
    # Gert the psi values and store them in the dictionary's lists
    psi_values[ss].append(pose_clean.psi(i))
    
# Create a figure
plt.figure(dpi=100, figsize=(5,5))

# For each SS type create a scatter plot with a diffrent color
for ss in phi_values:
    plt.scatter(phi_values[ss], psi_values[ss], c=color[ss], label=ss, s=5)
    
# Generate labels for each axis
plt.xlabel('$\phi$')
plt.ylabel('$\psi$')

# Define the plot x and y limits
plt.xlim(-180,180)
plt.ylim(-180,180)

# Set a title
plt.title('My second Ramachandran')

# Plot the legends of the SS type
plt.legend()

### Wrapping up

In this practice session we learned:

    1) How to work with jupyeter-notebooks

    2) How to use PyRosetta to access a protein structure

    3) How to link PyRosetta to PyMol for directly visualizying your analsyis

    4) How to manipulate a protein's geometry

    5) How to create a Ramachandran plot
