# Random Input 10-minute Challenge

Julien gave me a 10-minute challenge. What could I learn from a random molecular input file using [sire](https://sire.openbiosim.org) in just 10 minutes? And so here I am, with the URLs for two files, `SYSTEM.top` and `SYSTEM.crd`.

Step one was to get access to [sire](https://sire.openbiosim.org). That's pretty easy now that we've made it available through our Jupyter notebook service at https://try.openbiosim.org. I went to this service, logged in using my GitHub username, and then started a blank Jupyter notebook. In the top cell I imported [sire](https://sire.openbiosim.org), following the instructions we've put in the [quickstart guide](https://sire.openbiosim.org/quickstart/).

In [None]:
import sire as sr

I'd been told that the files were in a GitHub repo, and that I could download them via the URL `https://github.com/OpenBioSim/posts/raw/main/sire/001_ten_minute`. This was worth putting into a variable...

In [None]:
url = "https://github.com/OpenBioSim/posts/raw/main/sire/001_ten_minute"

Rather than downloading the files myself, I used the new [sire.load](https://sire.openbiosim.org/tutorial/part01/02_loading_a_molecule.html) function to download and load the files directly from a URL. I used the [sire.expand](https://sire.openbiosim.org/tutorial/part01/05_loading_from_multiple_files.html#loading-from-multiple-files) function to specify multiple files at once, prefixing their names with the above `url`.

In [None]:
mols = sr.load(sr.expand(url, "SYSTEM.top", "SYSTEM.crd"))

This has returned the loaded molecules, which I've put into `mols`. Whenever I load something, I like to just print it to the screen to quickly see what I have got.

In [None]:
mols

Aha - so we can already see that this file contained 18,575 molecules, across 18,914 residues and 60,695 atoms. That's interesting, but we can do so much more!

We can use [search functionality](https://sire.openbiosim.org/cheatsheet/search.html) to search for different molecules (and parts of molecules). A useful search term is `water`. This returns all of the water molecules that have been loaded.

In [None]:
mols["water"]

Not bad - there are quite a few! 18,459 of the 18,575 molecules are waters. So what about the rest? Another useful search term is `protein`...

In [None]:
mols["protein"]

Ok - three protein molecules, each of which has 1724 atoms over 114 residues. I suspect this may be a homotrimer? We can check by looking at the amino acid sequence of each protein.




In [None]:
proteins = mols["protein"]

seq1 = proteins[0].residues().names() 

print("Protein 2 has the same sequence as 1?", seq1 == proteins[1].residues().names())
print("Protein 3 has the same sequence as 1?", seq1 == proteins[2].residues().names())

Not a homotrimer? How are they different?

In [None]:
seq3 = proteins[2].residues().names()

for res1, res3 in zip(proteins[0].residues(), proteins[2].residues()):
    if res1.name() != res3.name():
        print(f"{res1.name().value()}:{res1.number().value()} is different to "
              f"{res3.name().value()}:{res3.number().value()}\n")

print(":".join([x.value() for x in seq1]))
print(":".join([x.value() for x in seq3]))

Ok - it is just a different in titration state for the histidine residues (HID:63 in the first protein versus HIE:291 in the third). Yes, I think we can call this a homotrimer.

What else can we find? 

Unfortunately, there isn't an easy way to define a "ligand". So, instead, lets look for everything that is neither a protein or water...

In [None]:
mols["not (protein or water)"]

It looks like we have 113 molecules. The first is called "LIG", so it probably is the ligand. The others look like sodium and chloride ions. Just to be sure that there is only one ligand, lets look for everything that has more than one atom, and is also not protein or water...

In [None]:
ligand = mols["count(atoms) > 1 and not (protein or water)"]
ligand

Cool - we have a single ligand. But what does it look like? Let's use the [view function](https://sire.openbiosim.org/quickstart/index.html#quick-start-guide) and take a look.

In [None]:
ligand.view()

This is a nice 3D view of the ligand, that is built using [sire's](https://sire.openbiosim.org) integration with [nglview](https://nglviewer.org/#nglview).

We can do more than just look at the molecule. The input files are in Amber format, so include molecular mechanics parameters for the molecules. This means we can use [sire's](https://sire.openbiosim.org) built-in molecular mechanics engine to [calculate the energy](https://sire.openbiosim.org/tutorial/part04/03_energies.html).

In [None]:
ligand.energy()

This has return the total energy of the ligand. But this energy is made up of components, such as the bond, angle and dihedral terms. We can access those too!

In [None]:
ligand.energy().components()

It's interesting, looking at these energies, that the molecule appears to be in a higher energy conformation. The total energy is positive (over 37 kcal mol-1), driven by high angle, dihedral and 1-4 non-bonded energies. Maybe being bound to the protein has forced the ligand into an unfavourable conformation?

To find out, let's look to see what is around the ligand. It would be really convenient to have a `ligand` search term... Fortunately, we have the power to create our own search terms via [sire.search.set_token](https://sire.openbiosim.org/cheatsheet/search.html#creating-custom-search-tokens).

In [None]:
sr.search.set_token("ligand", "count(atoms) > 1 and not (protein or water)")

This has created the token `ligand`, meaning we can now use this directly to search for ligand molecules...

In [None]:
ligand = mols["ligand"]
ligand

Let's now find all the protein residues that are within 3 Å of the ligand...

In [None]:
residues = mols["(residues within 3 of ligand) and protein"]
residues

There are 12 residues. These appear to be from only two of the protein molecules. We can confirm by asking for the molecules that contain these residues...

In [None]:
residues.molecules()

Yep - just two of the protein chains have residues that are within 3 Å of the ligand (at least for this conformation). Let's take a look at these residues...

In [None]:
residues.view()

They form a nice little pocket which I think the ligand would fit nicely into...

In [None]:
mols["((residues within 3 of ligand) and protein) or ligand"].view()

Yes, that does look pretty snug. But how snug?

Let's loop over all the residues and calculate their energy with the ligand. We'll put this into a python dictionary, in a format that will make it easy to analyse via [pandas](https://pandas.pydata.org) later...

In [None]:
data = {"residue": [], "component": [], "energy": []}

for residue in residues:
    # get the name and number of each residue as an ID
    resid = f"{residue.name().value()}:{residue.number().value()}"
    
    # calculate the energy between this residue and the ligand
    energy = residue.energy(ligand)
    
    # now save the components of this residue into the dictionary above...
    for component in energy.components():
        data["residue"].append(resid)
        data["component"].append(component)
        data["energy"].append(energy[component].to(sr.units.kcal_per_mol))

    # also save the total energy into the dictionary
    data["residue"].append(resid)
    data["component"].append("total")
    data["energy"].append(energy.to(sr.units.kcal_per_mol))

I chose the dictionary format above as it makes it really easy to import this data into a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html).

In [None]:
import pandas as pd
df = pd.DataFrame.from_dict(data)
df

Now this is in a DataFrame, we can use the in-built plotting tools to create a bar chart of the total energies...

In [None]:
df[ df["component"] == "total" ].plot.bar(x="residue")

It is clear that the interaction between the ligand and LYS:33 is the strongest and most favourable. This interaction will have both coulomb and Lennard Jones components...

In [None]:
df[ df["component"] == "LJ" ].plot.bar(x="residue")

The Lennard Jones components are relatively small, but all (except for ASN:326) pretty favourable...

In [None]:
df[ df["component"] == "coulomb" ].plot.bar(x="residue")

The coulomb components are driven by LYS:33 and ASN:326. Indeed, most of the favourable binding energy appears to be coming from the coulomb interaction between LYS:33 and the ligand. Let's take a look to see if we can understand why?

In [None]:
mols["ligand or (resname LYS and resnum 33)"].view()

Maybe there is a specific atom-atom interaction that is responsible? To check, we can loop over all pairs of atoms between the ligand and this residue to find the closest pair...

In [None]:
# Start by setting the closest value to a large distance...
closest = (1000 * sr.units.angstrom, None, None)

# loop over all atoms in LYS:33
for atom0 in mols["resname LYS and resnum 33"].atoms():
    # and then loop over all atoms in the ligand
    for atom1 in ligand:
        # calculate their distance using the sr.measure function...
        dist = sr.measure(atom0, atom1)
        
        # if the distance is less than `closest`, then save this
        # distance and the pair of atoms
        if dist < closest[0]:
            closest = (dist, atom0, atom1)

The above code uses the [sire.measure](https://sire.openbiosim.org/tutorial/part04/01_measure.html#making-measurements-between-atoms) function. This can be used to measure lengths, angles and torsions between pretty much anything in [sire](https://sire.openbiosim.org). In this case, it calculated the distance between each pair of atoms. The closest pair were saved into the variable `closest`.

In [None]:
print(closest)

The two closest atoms were HZ1 of LYS:33 and NAM of the ligand. We can calculate their interaction energies using the [energy](https://sire.openbiosim.org/tutorial/part04/03_energies.html#getting-energy-components) function again...

In [None]:
atom0 = closest[1]
atom1 = closest[2]

print(atom0.energy(atom1))
print(atom0.energy(atom1).components())

This energy is a significant chunk of the residue-ligand energy.

With this, my 10 minutes are up. So what did I find?

1. This is a model of a ligand bound to a large trimeric protein (a homotrimer, but one where the titratio state of a histindine is different between the first and third proteins). This is solvated in a bath of water molecules and counter ions.
2. The ligand is bound in a slightly unfavourable conformation.
3. The closest residues are from two of the proteins, which form a pocket within which the ligand has bound.
4. The ligand is bound most tightly to LYS:33, through a strong electrostatic interaction
5. This interaction is mostly between the HZ1 atom of lysine and NAM atom of the ligand. These two atoms are separated by 2.4 Å.

Given another 10 minutes, and perhaps even a dynamics trajectory, I wonder what else I can find? But, for now, Julien, how did I do?