<!--NOTEBOOK_HEADER-->
*This notebook contains material from [PyRosetta](https://RosettaCommons.github.io/PyRosetta);
content is available [on Github](https://github.com/RosettaCommons/PyRosetta.notebooks.git).*

<!--NAVIGATION-->
< [Side Chain Conformations and Dunbrack Energies](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/06.01-Side-Chain-Conformations-and-Dunbrack-Energies.ipynb) | [Contents](toc.ipynb) | [Index](index.ipynb) | [Protein Design with a Resfile and FastRelax](http://nbviewer.jupyter.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/06.03-Design-with-a-resfile-and-relax.ipynb) ><p><a href="https://colab.research.google.com/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/06.02-Packing-design-and-regional-relax.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>

# RosettaCarbohydrates: Trees, Selectors and Movers
Keywords: carbohydrate, glycan, glucose, mannose, sugar, ResidueSelector, Mover

## Overview
Here, we will cover useful `ResidueSelectors` and `Movers` available in the RosettaCarbohdyrate framework.  All of these framework components form the basis for the tools you will use in the next tutorial, Glycan Modeling and Design.

**Make sure you are in the directory with the pdb files:**

`cd google_drive/My\ Drive/student-notebooks/`

## Imports

Before we begin, we must import some specific machinery from Rosetta.  Much of these tools are automatically imported when we do `from pyrosetta import *`, however, some are not. You should get into the habit of importing everything you need.  This will get you comfortable with the organization of Rosetta and make it easier to find tools that are beyond the scope of these workshops.

In [None]:
# Notebook setup
import sys
if 'google.colab' in sys.modules:
    !pip install pyrosettacolabsetup
    import pyrosettacolabsetup
    pyrosettacolabsetup.setup()
    print ("Notebook is set for PyRosetta use in Colab.  Have fun!")

In [2]:
#Python
from __future__ import print_function
from pyrosetta import *
from pyrosetta.rosetta import *
from pyrosetta.teaching import *


## Intitlialization 

Here, we will be opening a PDB file with glycans, so we will use `-include_sugars` and a few other options that allow us to read (most) PDB files without issue. It is always a good idea to use the `GlycanInfoMover` to double check that the glycans you are interested in are properly represented by Rosetta.  If they are not, post the issue in the Rosetta forums.

Once again, more information on working with glycans can be found at this page: [Working With Glycans](https://www.rosettacommons.org/docs/latest/application_documentation/carbohydrates/WorkingWithGlycans)

### PDB vs Rosetta sugar format

Unfortunately, there are few standards in the PDB for how saccharide residues in `.pdb` files should be numbered and named. The Rosetta code — with the appropriate flags initialization flags, such as `-alternate_3_letter_codes pdb_sugar` tries its best to interpret `.pdb` files with sugars, but because of ambiguity and inconsistency, success is in no way ensured.  See http://www.rosettacommons.org/docs/latest/rosetta_basics/preparation/Preparing-PDB-files-for-non-peptide-polymers for more info


To guarantee that one can model the specific saccharide system desired unabiguously, Rosetta uses a slightly modified `.pdb` format for importing carbohydrate residues. The key difference in formats involves the `HETNAM` record of the PDB format. The standard PDB `HETNAM` record line:</p>

```HETNAM     GLC ALPHA-D-GLUCOSE```

...means that all `GLC` 3-letter codes in the <em>entire file</em> are α-<font style="font-variant: small-caps">d</font>-glucose, which is insufficient, as this 
could mean several different α-<font style="font-variant: small-caps">d</font>-glucoses, depending on the ring form and on the main-chain connectivity of the glycan — and 
many, many more if one includes modified sugars! The modified Rosetta-ready PDB `HETNAM` 
record line:</p>

```HETNAM     Glc A   1  ->4)-alpha-D-Glcp```

...means that the `GLC` residue <em>specifically at position A1</em> requires the `->4)-alpha-D-Glcp` `ResidueType` or any of its `VariantType`s. (Note also that Rosetta uses sentence case 3-letter-codes for sugars.)</p>

Rosetta will output and input with this default format. 
We use `-alternate_3_letter_codes pdb_sugar` to read in the PDB-format sugar and `-write_glycan_pdb_codes` to output the PDB format since we will be working with a structure directly from the PDB.





In [3]:
options = """
-ignore_unrecognized_res
-include_sugars
-auto_detect_glycan_connections
-maintain_links 
-alternate_3_letter_codes pdb_sugar
-write_glycan_pdb_codes
-ignore_zero_occupancy false 
-load_PDB_components false
-no_fconfig
"""

In [4]:
init(" ".join(options.split('\n')))

PyRosetta-4 2019 [Rosetta PyRosetta4.Release.python36.mac 2019.39+release.93456a567a8125cafdf7f8cb44400bc20b570d81 2019-09-26T14:24:44] retrieved from: http://www.pyrosetta.org
(C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team.
[0mcore.init: [0mRosetta version: PyRosetta4.Release.python36.mac r233 2019.39+release.93456a567a8 93456a567a8125cafdf7f8cb44400bc20b570d81 http://www.pyrosetta.org 2019-09-26T14:24:44
[0mcore.init: [0mcommand: PyRosetta -ignore_unrecognized_res -include_sugars -auto_detect_glycan_connections -maintain_links -alternate_3_letter_codes pdb_sugar -write_glycan_pdb_codes -ignore_zero_occupancy false -load_PDB_components false -no_fconfig -database /Users/jadolfbr/Library/Python/3.6/lib/python/site-packages/pyrosetta-2019.39+release.93456a567a8-py3.6-macosx-10.6-intel.egg/pyrosetta/database
[0mbasic.random.init_random_generator: [0m'RNG device' seed mode, using '/dev/urandom', seed=-535540840 seed_offset=0 rea

In [5]:
pose = pose_from_pdb("inputs/glycans/4do4_refined.pdb")
 

[0mcore.chemical.GlobalResidueTypeSet: [0mFinished initializing fa_standard residue type set.  Created 1251 residue types
[0mcore.chemical.GlobalResidueTypeSet: [0mTotal time to initialize 1.24028 seconds.
[0mcore.import_pose.import_pose: [0mFile 'inputs/glycans/4do4_refined.pdb' automatically determined to be of type PDB
[0mcore.io.util: [0mAutomatic glycan connection is activated.
[0mcore.io.util: [0mStart reordering residues.
[0mcore.io.util: [0mCorrected glycan residue order (internal numbering): [388, 389, 390, 391, 392, 393, 394, 395, 396, 797, 798, 799, 800, 801, 802, 803, 804, 805]
[0mcore.io.util: [0m
[0mcore.io.pose_from_sfr.PoseFromSFRBuilder: [0mSetting chain termination for 390
[0mcore.io.pose_from_sfr.PoseFromSFRBuilder: [0mSetting chain termination for 394
[0mcore.io.pose_from_sfr.PoseFromSFRBuilder: [0mSetting chain termination for 395
[0mcore.io.pose_from_sfr.PoseFromSFRBuilder: [0mSetting chain termination for 396
[0mcore.io.pose_from_sfr.PoseFr

[0mcore.conformation.Conformation: [0mcurrent variant for 110 CYD
[0mcore.conformation.Conformation: [0mcurrent variant for 141 CYD
[0mcore.conformation.Conformation: [0mFound disulfide between residues 506 537
[0mcore.conformation.Conformation: [0mcurrent variant for 506 CYS
[0mcore.conformation.Conformation: [0mcurrent variant for 537 CYS
[0mcore.conformation.Conformation: [0mcurrent variant for 506 CYD
[0mcore.conformation.Conformation: [0mcurrent variant for 537 CYD
[0mcore.conformation.Conformation: [0mFound disulfide between residues 170 192
[0mcore.conformation.Conformation: [0mcurrent variant for 170 CYS
[0mcore.conformation.Conformation: [0mcurrent variant for 192 CYS
[0mcore.conformation.Conformation: [0mcurrent variant for 170 CYD
[0mcore.conformation.Conformation: [0mcurrent variant for 192 CYD
[0mcore.conformation.Conformation: [0mFound disulfide between residues 566 588
[0mcore.conformation.Conformation: [0mcurrent variant for 566 CYS
[0mcore.

## Object Exploration: GlycanTreeSet, CarbohydrateInfo, and the GlycanInfoMover

Before we do anything else, lets get some information on the pose that we are working with.

### GlycanTreeSet

The `GlycanTreeSet` is created when glycans are added to a pose or a pose is created with glycans in it.  The `GlycanTreeSet` has information on each glycan tree and each residue's parent and child.  The tree set also has an observer attached to it, so it will auto-update itself when glycan residues are attached or removed from the pose.  The `GlycanTreeSet` is a part of the Pose's `Conformation` object.  First, lets expore this. 

Lets fine out how many glycan trees are and their lengths. 

In [6]:
tree_set = pose.glycan_tree_set()

In [7]:
print(tree_set.n_trees())

6


Ok, so there are 6 glycan trees in our pose!  Cool.  Lets see what the largest one is:

In [9]:
print(tree_set.get_largest_glycan_tree_length())

5


#### GlycanTree and GlycanNode

The `GlycanTreeSet` is made up of `GlycanTree` objects.  Each of these is made up of `GlycanNodes` for each residue in a tree. Lets expore these.

In [11]:
for start in tree_set.get_start_points():
    print(start, pose.pdb_info().pose2pdb(start), pose.residue_type(start).name3(), pose.residue_type(start).name())

388 501 A  Glc ->4)-beta-D-Glcp:2-AcNH
391 504 A  Glc ->4)-beta-D-Glcp:2-AcNH
396 509 A  Glc ->3)-beta-D-Glcp:non-reducing_end:2-AcNH
797 501 B  Glc ->4)-beta-D-Glcp:->6)-branch:2-AcNH
800 504 B  Glc ->4)-beta-D-Glcp:2-AcNH
805 509 B  Glc ->3)-beta-D-Glcp:non-reducing_end:2-AcNH


Lets look at the parent of each of these glycan start points to see if they are connected to a protein, and if so, what residue they are attached to.

In [12]:
for start in tree_set.get_start_points():
    parent = tree_set.get_parent(start)
    parent_naem = "NONE"
    if parent != 0:
        parent_name = pose.residue_type(parent).name3()
    print(parent, pose.pdb_info().pose2pdb(parent), parent_name)

107 124 A  ASN
160 177 A  ASN
368 385 A  ASN
503 124 B  ASN
556 177 B  ASN
764 385 B  ASN


Cool.  So they are all connected to protein residues at an Asparigine.  Lets take a look at the first sugar. 

In [13]:
tree1 = tree_set.get_tree(388)

In [15]:
print("length", tree1.size())
print("root", tree1.get_root())

length 3
root 107


In [16]:
for res in tree1.get_residues():
    print(res, pose.residue_type(res).name3(), pose.residue_type(res).name())

388 Glc ->4)-beta-D-Glcp:2-AcNH
389 Glc ->4)-beta-D-Glcp:2-AcNH
390 Man ->3)-beta-D-Manp:non-reducing_end


Lets take a closer look at that Mannose, at the end of the tree.

In [19]:
node390 = tree1.get_node(390)

In [23]:
print("n_children", len(node390.get_children()))
print("parent", node390.get_parent())
print("distance", node390.get_distance_to_start())
print("exocylic_connection", node390.has_exocyclic_linkage())

n_children 0
parent 389
distance 2
exocylic_connection False


### CarbohydrateInfo

Lets get a bit more information on this particular glycan residue.

In [24]:
info390 = pose.residue_type(390).carbohydrate_info()

In [25]:
info390.anomeric_carbon()

1

In [38]:
info390.anomeric_carbon_name()

'C1'

In [28]:
info390.basic_name()

'mannose'

In [29]:
info390.cyclic_oxygen()

5

In [30]:
info390.cyclic_oxygen_name()

' O5 '

In [31]:
info390.full_name()

'beta-D-mannopyranosyl'

In [32]:
info390.has_exocyclic_linkage_to_child_mainchain()

False

In [33]:
info390.is_alpha_sugar()

False

In [34]:
info390.is_amino_sugar()

False

In [35]:
info390.is_beta_sugar()

True

In [36]:
info390.is_cyclic()

True

In [37]:
info390.is_acetylated()

False

As you can see, the `CarbohydrateInfo` object of `ResidueType` provides a great deal of information on this particular sugar.  By using the `GlycanTreeSet` and the `CarbohdrateInfo` objects, one can delineate nearly everything you wish to know about about a particular tree, glycan, and the connections of them in respect to each other and the whole pose. 

## GlycanInfoMover

This mover essentially prints much of the connectivity information of a particular pose.  It is useful as a first-pass to get general info and to make sure that Rosetta is loading your glycan properly.

Note: You will need to look at the terminal for output of this mover.

In [39]:
from rosetta.protocols.analysis import *

  """Entry point for launching an IPython kernel.


In [40]:
glycan_info = GlycanInfoMover()
glycan_info.apply(pose)

(Output copied below)

```
branch Point: ASN 107 124 A 
Branch Point: ASN 160 177 A 
Branch Point: ASN 368 385 A 
Carbohydrate: 388 501 A  Parent: 107 BP: 0 501 A   CON: _->4       DIS: 0 ShortName: ->4)-beta-D-GlcpNAc-
Carbohydrate: 389 502 A  Parent: 388 BP: 0 502 A   CON: _->4       DIS: 1 ShortName: ->4)-beta-D-GlcpNAc-
Carbohydrate: 390 503 A  Parent: 389 BP: 0 503 A   CON:            DIS: 2 ShortName: beta-D-Manp-
Carbohydrate: 391 504 A  Parent: 160 BP: 0 504 A   CON: _->4       DIS: 0 ShortName: ->4)-beta-D-GlcpNAc-
Carbohydrate: 392 505 A  Parent: 391 BP: 0 505 A   CON: _->4       DIS: 1 ShortName: ->4)-beta-D-GlcpNAc-
Carbohydrate: 393 506 A  Parent: 392 BP: 1 506 A   CON: _->3,_->6  DIS: 2 ShortName: ->3)-beta-D-Manp-
Carbohydrate: 394 507 A  Parent: 393 BP: 0 507 A   CON:            DIS: 3 ShortName: alpha-D-Manp-
Carbohydrate: 395 508 A  Parent: 393 BP: 0 508 A   CON:            DIS: 3 ShortName: alpha-D-Manp-
Carbohydrate: 396 509 A  Parent: 368 BP: 0 509 A   CON:            DIS: 0 ShortName: beta-D-GlcpNAc-
Branch Point: ASN 503 124 B 
Branch Point: ASN 556 177 B 
Branch Point: ASN 764 385 B 
Carbohydrate: 797 501 B  Parent: 503 BP: 1 501 B   CON: _->4,_->6  DIS: 0 ShortName: ->4)-beta-D-GlcpNAc-
Carbohydrate: 798 502 B  Parent: 797 BP: 0 502 B   CON:            DIS: 1 ShortName: beta-D-GlcpNAc-
Carbohydrate: 799 503 B  Parent: 797 BP: 0 503 B   CON:            DIS: 1 ShortName: alpha-L-Fucp-
Carbohydrate: 800 504 B  Parent: 556 BP: 0 504 B   CON: _->4       DIS: 0 ShortName: ->4)-beta-D-GlcpNAc-
Carbohydrate: 801 505 B  Parent: 800 BP: 0 505 B   CON: _->4       DIS: 1 ShortName: ->4)-beta-D-GlcpNAc-
Carbohydrate: 802 506 B  Parent: 801 BP: 1 506 B   CON: _->3,_->6  DIS: 2 ShortName: ->3)-beta-D-Manp-
Carbohydrate: 803 507 B  Parent: 802 BP: 0 507 B   CON:            DIS: 3 ShortName: alpha-D-Manp-
Carbohydrate: 804 508 B  Parent: 802 BP: 0 508 B   CON:            DIS: 3 ShortName: alpha-D-Manp-
Carbohydrate: 805 509 B  Parent: 764 BP: 0 509 B   CON:            DIS: 0 ShortName: beta-D-GlcpNAc-
Glycan Residues: 18
Protein BPs: 6
TREES
107 124 A  Length: 3
160 177 A  Length: 5
368 385 A  Length: 1
503 124 B  Length: 3
556 177 B  Length: 5
764 385 B  Length: 1
```

### Branched Connections

Now we can see all of our glycans in the pose, all of their parents, and how all of them are connected to one another. Note residue 803 - here we have two connections.  both at carbons 3 and 6.  This means we have a branched connection and that residue 802 has two children.  A branched connection is always at carbon 6, which is an exocyclic connection.  This point has 3 backbone dihedrals instead of our standard two.  Lets confirm all of that. 

In [50]:
#This is code used to get the branch points in CarbohydrateInfoMover, converted from C++:
def get_connections(localpose, resnum):
    info = localpose.residue(resnum).carbohydrate_info()
    outstring = ""
    attach = "_->"

    if info.mainchain_glycosidic_bond_acceptor():
        outstring = attach + str(info.mainchain_glycosidic_bond_acceptor())
    

    for i in range(1, info.n_branches()+1):
        outstring = outstring + "," +attach + str(info.branch_point( i ))
    
    return outstring;
                   
get_connections(pose, 802)

'_->3,_->6'

In [52]:
tree802 = tree_set.get_tree_containing_residue(802)
node802 = tree_set.get_node(802)

In [53]:
print("len", tree802.size())
print("children", node802.get_children())
print("exocyclic", node802.has_exocyclic_linkage())

len 5
children vector1_unsigned_long[803, 804]
exocyclic False


Note that 802 doesn't have an exocyclic back to it's parent - however, one of its children has the exocyclic connection back to it.  Lets find out which one.

In [54]:
print("exo_803", tree802.get_node(803).has_exocyclic_linkage())
print("exo_804", tree802.get_node(804).has_exocyclic_linkage())

exo_803 False
exo_804 True


Cool.  So residue 804 is branched connection. Lets take a closer look.

In [56]:
node804 = tree802.get_node(804)
node803 = tree802.get_node(803)

In [58]:
node802.get_mainchain_child()

803

### MoveMapFactory vs MoveMap creation

Here is something important to note.  Rosetta has a concept of the 'mainchain' as it was primarily written for proteins - that are linear in nature.  At the deep part of Rosetta, even sugars are denoted as having a 'mainchain'.  This mainchain is the 'non-branched' connections.  In this case, the mainchain continues onto residue 803, while the 'branch' goes off to residue 804.  This is __EXTREMELY__ important to be aware of as MoveMaps have seperate switches for 'branched' torsions.  In this way, you should always use the `MoveMapFactory` which does all this automatically for creating glycan Movemaps or torsions that are branched will not be turned on!!! 

After that side-note, lets confirm that there are indeed 3 torsions for the branched connection of residue 802 and 804. Remember that torsions are defined from child TO parent!

In [61]:
from rosetta.core.pose.carbohydrates import *
from rosetta.core.conformation.carbohydrates import *

In [63]:
get_n_glycosidic_torsions_in_res(pose.conformation(), 804)

3

Great.  We have 3. Lets make sure our mainchild child has two.

In [64]:
get_n_glycosidic_torsions_in_res(pose.conformation(), 803)

2

Awesome.  Finally, lets see how many torsions between our first glycan residue of this tree and the ASN.  Note that ASN has 3 'chi' angles before glycosylation.

In [68]:
get_n_glycosidic_torsions_in_res(pose.conformation(), tree1.get_start())

4

After glycosylation, this ASN chi is no long has side-chains to pack.  In the packer, they are turned off, as they are now part of the glycan backbone.  How is this done?  Lets see.

In [None]:
print(pose.residue_type(node))

In [70]:
protein_res = tree802.get_node(tree802.get_start()).get_parent()
print(protein_res, pose.residue_type(protein_res).name3())

556 ASN


In [71]:
print("Is Branch Point:", pose.residue(protein_res).is_branch_point())

Is Branch Point: True


Ok, Now we can see that this residue is a branch point - meaning that it once again has a mainchain connection that goes onto the the next protein residue, and a branch out to the start of the glycan.  Take a look at the rest of the glycan residues.  Which are the branch points?  Does this info match what the `GlycanInfoMover` printed?

## Glycan Residue Selectors

Now that we have a good idea about the glycans in our pose, lets use some residue selectors that use the underlying tools that we just learned about. 

### GlycanResidueSelector

The most basic, but useful selector is the `GlycanResidueSelector`.  Here is the description:
```
A ResidueSelector for carbohydrates and individual carbohydrate trees.
  Selects all Glycan residues if no option is given or the branch going out from the root residue. 
  Selecting from root residues allows you to choose the whole glycan branch or only tips, etc.
```

First, lets select all carbohydrate residues in the pose.

In [74]:
from rosetta.core.select.residue_selector import *
glycan_selector = GlycanResidueSelector()
all_glycans = glycan_selector.apply(pose)
for i in range(1, pose.size()+1):
    if all_glycans[i]:
        print(i, pose.residue_type(i).name3())

388 Glc
389 Glc
390 Man
391 Glc
392 Glc
393 Man
394 Man
395 Man
396 Glc
797 Glc
798 Glc
799 Fuc
800 Glc
801 Glc
802 Man
803 Man
804 Man
805 Glc
