# AMPAL and structural analysis

Biomolecules in ISAMBARD are represented using the AMPAL (Atom, Monomer, Polymer, Assembly, Ligand) framework. This is a formal representation of biomolecules in a hierarchical structure of lightweight Python objects that enable you to navigate through the protein structure from the atomic level to the assembly level and vice versa. The image below shows the flow from the `Atom` to the `Assembly` level.
![AMPAL_basic](imgs/AMPAL_basic.png)

This tutorial demonstrates how AMPAL objects work, and introduces tools built into these objects for structural analysis and validation.

# 1. Importing a structure into the AMPAL framework
Firstly, import isambard into the Python environment, then load in one of the structure files provided, `3UEJ.pdb`. We'll use `nglview` to view the protein along the way so you have a visual check of what you're working with.

In [None]:
import isambard
import nglview as nv
from pprint import pprint

In [None]:
my_pdb = isambard.ampal.convert_pdb_to_ampal("pdbs/3UEJ.pdb")

Have a look at what you've got:

In [None]:
my_pdb

The `.pdb` attribute lets you access the PDB formatted structure as a string. We can view this with NGLView by defining two simple functions:

In [None]:
def show_ball_and_stick(ampal):
    view = nv.show_text(ampal.pdb)
    view.add_ball_and_stick()
    view.remove_cartoon()
    return view

In [None]:
def show_cartoon(ampal):
    view = nv.show_text(ampal.pdb)
    return view

In [None]:
show_ball_and_stick(my_pdb)

This structure contains two `Polypeptide` chains, and 230 `Ligand` objects (water, zinc and phosphate). We'll worry about the Ligand objects later, let's focus on the `Polypeptide` for now). Individual `Polypeptide`s are accessed by means of a list index:

In [None]:
my_pdb[0] # the first polypeptide in the assembly

.... or by using a chain identifier as a string:

In [None]:
my_polypeptide = my_pdb['A']

The chain identifier can be accessed via the `.id` attribute

In [None]:
my_polypeptide.id

In [None]:
show_ball_and_stick(my_polypeptide)

## Navigating the AMPAL hierarchy
You can get back to the `Assembly` object via the `.ampal_parent` attribute

In [None]:
my_polypeptide.ampal_parent

You can get a list of individual residues via the `.get_monomers()` method. This returns a Python iterator object, but if you're not comfortable using these you can move straight to a list.

In [None]:
my_residues = list(my_polypeptide.get_monomers())

Individual residues can be accessed from this list via index

In [None]:
my_residues[0]

Alternatively, you can get a residue via its index or via PDB number directly from the polypeptide object:

In [None]:
my_polypeptide[0] # The first residue in the polypeptide

In [None]:
my_polypeptide['222'] # The residue numbered 222 in the PDB file (also, the first residue!)

In [None]:
my_residue = my_polypeptide['222']

You can find more information about the residue using the `.mol_code`, `mol_letter` and `.id` attributes

In [None]:
my_residue.mol_code

In [None]:
my_residue.mol_letter

In [None]:
my_residue.id

You can get an ordered dictionary of atoms via the `.get_atoms()` method:

In [None]:
my_residue.get_atoms()

or you can access an atom directly by a dictionary look-up:

In [None]:
my_residue['CA']

and its coordinates via the `.x`, `.y` and `.z` attributes

In [None]:
print (my_residue['CA'].x, my_residue['CA'].y, my_residue['CA'].z)

You can get back to the `Residue`, `Polypeptide` and `Assembly` objects using `.ampal_parent`:

In [None]:
my_atom = my_residue['CA']

In [None]:
my_atom.ampal_parent

In [None]:
my_atom.ampal_parent.ampal_parent

In [None]:
my_atom.ampal_parent.ampal_parent.ampal_parent

And you can go from the `Assembly` level right down to the `Atom` level in one step:

In [None]:
my_pdb['A']['222']['CA']

## Selections and tagging

The polypeptide sequence can be accessed via the `.sequence` attribute

In [None]:
my_polypeptide.sequence

You can select a region of structure by two methods:
* via the residue index (from the 0th to the nth residue in the polypeptide)
* via the PDB residue numbering - in this structure the residues start at 222

via residue index:

In [None]:
my_selection = my_polypeptide[0:15]

In [None]:
my_selection

via PDB residue numbering using `.get_slice_from_res_id('start id','end id')`:

In [None]:
my_other_selection = my_polypeptide.get_slice_from_res_id('240','260')

In [None]:
my_other_selection

Let's view these in nglview:

In [None]:
show_ball_and_stick(my_other_selection)

### Select on the basis of secondary structure
This runs DSSP automatically and assigns secondary structure. You can then use the `.helices` and `.strands` attributes to access these elements of secondary structure, which are returned as `Assembly` objects.

In [None]:
my_helices = my_polypeptide.helices

In [None]:
my_strands = my_polypeptide.strands

In [None]:
my_strands

In [None]:
my_strands[0].sequence

In [None]:
show_cartoon(my_strands)

### Tagging
Once the secondary structure is assigned, each residue is 'tagged' with its secondary structure. Each level in the AMPAL hierarchy has a dictionary attached to it called 'tags', accessed via the `.tags` attribute. When `.helices` or `.strands` is called, each `Residue` in the AMPAL object is tagged with its secondary structure. The following code prints the tags of the first strand residue:

In [None]:
my_strands[0][0].tags

Or, the secondary structure tags of all the residues in the selection we made earlier:

In [None]:
pprint ([x.tags['secondary_structure'] for x in my_selection.get_monomers()])

There are several direct methods for tagging:

* `.tag_ca_geometry()` 
* `.tag_secondary_structure()`
* `.tag_sidechain_dihedrals()`
* `.tag_torsion_angles()`
* `.tag_residue_solvent_accessibility()` (requires NACCESS http://wolf.bms.umist.ac.uk/naccess/)

> ### Note
> Don't forget that you can see information on specific functions/classes in a number of ways:
> 1. Check the [API documentation](https://woolfson-group.github.io/isambard/api_reference.html)
> 1. Take a look at the [source code](https://github.com/woolfson-group/isambard/tree/master/isambard)
> 1. Shift+Tab inside the round brackets if you're using Jupyter Notebook
> 1. Use the Python `help` function e.g. `help(isambard.ampal.convert_pdb_to_ampal)`

### A rudimentary Ramachandran plot:

In [None]:
my_polypeptide.tag_torsion_angles()

In [None]:
phi = [x.tags['phi'] for x in my_polypeptide]
psi = [x.tags['psi'] for x in my_polypeptide]    

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.axhline(0, color='black', linewidth=1)
plt.axvline(0, color='black', linewidth=1)
plt.scatter(phi,psi)
plt.xlabel("Phi")
plt.ylabel("Psi")
plt.xlim(-180, 180)
plt.ylim(-180, 180)
None

You can use your own tags:

In [None]:
for x in my_selection:
    x.tags['my_tag'] = 'My Value'

In [None]:
print ([x.tags['my_tag'] for x in my_selection])

## Dealing with ligands
Ligands can be accessed via the `.ligands` attribute at the `Polypeptide` level, or the `.get_ligands()` method at the `Assembly` level. We'll just work with the `Polypeptide`.

In [None]:
my_ligands = my_polypeptide.ligands

In [None]:
pprint ([x.mol_code for x in my_ligands])

As you can see, most of these are for water, but there are two zinc atoms which are of interest.

In [None]:
my_zinc1 = my_ligands[0]

In [None]:
my_zinc1

We can look at the environment surrounding the zinc ions at a defined distance cutoff:

In [None]:
my_zinc1.close_monomers(my_pdb, cutoff=4.0)

### To view these in NGLView, we need to make a dummy `Assembly` object for isambard:

In [None]:
my_zinc_env = my_zinc1.close_monomers(my_pdb, cutoff=4.0)

In [None]:
my_zinc_assembly = isambard.ampal.Assembly()
for x in my_zinc_env:
    my_zinc_assembly.append(isambard.ampal.Polymer(x))

In [None]:
show_ball_and_stick(my_zinc_assembly)

## Scaling it up
All this is very well, but nothing you can't do in PyMOL or similar with ease and a few clicks. We are now going to work with a much larger set of structures taken from the PDB to do some analysis - something which is harder to do over a large set of structures in a GUI-style environment.

### RCSB (`http://rcsb.org`) query:

We queried the RCSB to get a set of x-ray crystal structures of proteins with zinc ligands. This returned 84 structures which are included as part of the tutorial, along with one NMR structure we added to demonstrate the `AmpalContainer` class. The RCSB PDB query is below if you would like to repeat it.

_`Ligand Search` : Has free ligands=yes and Chemical Name: Name Contains zinc and Polymeric type is Any and Sequence Length is between 40 and 100 and Holdings : Molecule Type=protein Experimental Method=X-RAY and Resolution is 1.499 or less_ 

* returned 84 structures + one added NMR structure
* all files in a list called `pdb_list` in your working directory

### Read in the list and get all structures into the Ampal framework

In [None]:
with open('pdb_list','r') as in_list:
    structures = [x.rstrip() for x in in_list.readlines()]
    

In [None]:
my_structures = []
for s in structures:
    try:
        m = isambard.ampal.convert_pdb_to_ampal(s)
        my_structures.append(m)
    except:
        FileNotFoundError()

## AmpalContainer
AmpalContainer is one level above an Assembly, and allows for multiple-model structures such as NMR ensembles. Use the code below to find which of the structures is the multi-model NMR structure. We'll just take the first model for this structure.
![AMPAL_Container](imgs/AMPAL_inheritance_incl_ampal_container.png)

In [None]:
my_ampal_structures = []
for m in my_structures:
    if isinstance(m,isambard.ampal.AmpalContainer):
        print("{} is the NMR structure".format(m.id))
        print("Taking 1st model only")
        first_structure = m[0]
        my_ampal_structures.append(first_structure)
    else:
        my_ampal_structures.append(m)

Now let's write some code to identify where the zinc ions are in each structure, and pull out their environment.

In [None]:
my_zn_envs = []
for structure in my_ampal_structures:
    print ("Examining {}".format(structure.id))
    ligs = structure.get_ligands()
    
    for n in ligs:
        if n.mol_code == "ZN":
            print ("{} ZN here".format(structure.id))
            zn_env = n.close_monomers(structure, cutoff=4.0)
            my_zn_envs.append(zn_env)

## Analysis
Can you use the isambard code you've learnt so far together with a bit of python to analyse these zinc binding sites?

Hints: you could use a dictionary to keep track of amino acid counts, or you could keep a tally of distances in a list.

Which amino acid residues are typically closest to the zinc ions?

Sample code to use below if you don't have any ideas.

### Sequence analysis

In [None]:
my_amino_acid_count = {}
my_amino_acids = 'ACDEFGHIKLMNPQRSTVWY'

for x in list(my_amino_acids):
    my_amino_acid_count[x] = 0

for env in my_zn_envs:
    for residue in env:
        if type(residue) is isambard.ampal.Residue:
            my_amino_acid_count[residue.mol_letter] += 1

In [None]:
my_amino_acid_count

### Can you work out the mean distance from each zinc atom to its binding residues? 
Hint: there is a function `isambard.tools.geometry.distance()` which takes two atom objects as arguments and returns the distance between them.

In [None]:
my_distances = {}
for x in list(my_amino_acids):
    my_distances[x] = []
    
for env in my_zn_envs:
    my_zinc = None
    for residue in env:
        if residue.mol_code == "ZN":
            my_zinc = residue
    
    for residue in env:
        if type(residue) is isambard.ampal.Residue:
            my_distance = isambard.tools.geometry.distance(my_zinc['ZN'],residue['CA'])
            my_distances[residue.mol_letter].append(my_distance)

### Find the average distance for the cysteine residues

In [None]:
import numpy as np
cys_array = np.array(my_distances['C'])
np.mean(cys_array)

In [None]:
np.std(cys_array)

## Phenylalanine?
One of the zinc binding sites has a phenylalanine residue close by. Find it, and see if you can work out what role the phenylalanine might be playing, if any.

In [None]:
for env in my_zn_envs:
    for residue in env:
        if residue.mol_code == "PHE":
            my_phe_assembly = isambard.ampal.Polypeptide(env)
            print ("PDB code is {}".format(residue.ampal_parent.ampal_parent.id))
            print ("Chain ID is {}".format(residue.ampal_parent.id))
            print ("Residue number is {}".format(residue.id))
            
            for residue in env:
                if residue.mol_code == "ZN":
                    print ("Zinc is {} {}".format(residue.id, residue.ampal_parent.id))

In [None]:
view = nv.show_file("pdbs/4L7X.pdb")
view.add_representation('spacefill',selection="101:A",color='green')
view.add_ball_and_stick("{}:{}".format(" or ".join([str(x.id) for x in my_phe_assembly]),'A'))
view

## Summary
You should now be able to:

+ import PDB structures into the AMPAL framework
+ query structures by secondary structure and residue identity
+ tag AMPAL objects 
+ look at the environment around certain atoms
+ calculate distances between atoms