# Welcome to ProLint2

ProLint2 is a tool for the analysis of protein-lipid interactions. It has been completely rewritten and now includes many new features: 

1. Orders of magnitude faster than the original ProLint
2. Modular design for easy extension
3. Completely new visualization front end
4. Many other new features

### How to install the package

Please read the [README](https://github.com/ProLint/prolint2/blob/main/README.md) file for instructions on how to install ProLint2.

### How to get started

In [1]:
from prolint2.core import Universe

  from .autonotebook import tqdm as notebook_tqdm


ProLint now is built on top of MDAnalysis. We provide a wrapper class around MDAnalysis.Universe that allows you to load file and perform analysis the same way as you would with MDAnalysis. The Prolint Universe object, however, has additional methods that allow you to perform analysis on protein-lipid interactions. 

```python
from prolint2 import Universe
u = Universe('coordinates.gro', 'trajectory.xtc')
```

And that's it! You can now use the `u` object to perform analysis. This is exactly the same as you would with MDAnalysis. For example, to get the center of mass of the protein, you can do:

```python
u.select_atoms('protein').center_of_mass()
```

Of course, the reason to use ProLint is to analyze protein-lipid interactions. So let's see how we can do that. 

In [2]:
from prolint2.sampledata import GIRK
u = Universe(GIRK.coordinates, GIRK.trajectory)

In [3]:
n_frames = u.trajectory.n_frames
n_atoms = u.atoms.n_atoms

print (f'Number of frames: {n_frames}, number of atoms: {n_atoms}')

Number of frames: 13, number of atoms: 23820


### The `query` and `database` terminology

ProLint computes contacts between a reference group of atoms (usually the protein) that we call the `query`, and another group of atoms (usually the lipids) that we call the `database`. The query group is the group of atoms/residues that you want to analyze. The database group is the group of atoms that you want to analyze their interactions with. For example, if you want to analyze the interactions between a protein and surrounding lipids, the protein is the query and the lipids are the database group. 

When you create a `Universe` object ProLint will use proteins as the query and all other atoms as the database. You can access them by their attributes:

In [4]:
u.query, u.database

(<ProLint Wrapper for <AtomGroup with 2956 atoms>>,
 <ProLint Wrapper for <AtomGroup with 20864 atoms>>)

In [5]:
n_query_atoms = u.query.n_atoms
n_database_atoms = u.database.n_atoms

print (f'Number of query atoms: {n_query_atoms}, number of database atoms: {n_database_atoms}')

Number of query atoms: 2956, number of database atoms: 20864


Notice  how we use the `query` and `database` attributes return ProLint wrapper objects around the MDAnalysis.AtomGroup objects. This allows us to perform analysis on the query and database groups: 

In [6]:
u.query.resname_counts, u.database.resname_counts, u.database.get_resname(2345)

(Counter({'ARG': 64,
          'GLN': 40,
          'TYR': 40,
          'MET': 48,
          'GLU': 112,
          'LYS': 56,
          'THR': 96,
          'GLY': 76,
          'CYS': 32,
          'ASN': 52,
          'VAL': 104,
          'HIS': 28,
          'LEU': 120,
          'SER': 72,
          'ASP': 56,
          'PHE': 88,
          'TRP': 28,
          'ILE': 88,
          'ALA': 52,
          'PRO': 32}),
 Counter({'POPE': 652, 'POPS': 652, 'CHOL': 652}),
 'POPE')

### Computing contacts

We make it now very easy and extremely fast to compute contacts between the query and database groups. To compute contacts, you can use the `compute_contacts` method of the `Universe` object. This method takes the following arguments:
- `cutoff`: The cutoff distance to use for computing contacts. Units are in Angstroms.
- `backend`: The backend to use for computing contacts. Currently, this option is not used and the default backend is used. In the future, we will add more backends for computing contacts.

In [8]:
contacts = u.compute_contacts(cutoff=7) # cutoff in Angstroms

7


100%|██████████| 13/13 [00:00<00:00, 323.05it/s]


In [10]:
# This may take a few seconds because pandas is slow
df = contacts.create_dataframe(n_frames)
df 

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12
ResidueID,LipidId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,2482,0,0,0,1,0,0,0,0,0,0,0,0,0
1,2672,0,0,0,0,0,0,0,0,0,1,0,0,0
1,2681,0,1,0,0,0,0,0,0,0,0,0,0,0
1,2768,0,1,0,0,0,0,0,0,0,0,0,0,0
10,2648,0,0,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1259,2463,0,1,0,0,0,0,0,0,0,0,0,0,0
1259,2755,0,1,0,0,0,0,0,0,0,0,0,0,0
1259,2760,1,0,0,0,0,0,0,0,0,0,0,0,0
1261,2463,0,1,0,0,0,0,0,0,0,0,0,0,0


Note that the dataframe itself is very lightweight

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 5680 entries, (1, 2482) to (1263, 2468)
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       5680 non-null   int8 
 1   1       5680 non-null   int8 
 2   2       5680 non-null   int8 
 3   3       5680 non-null   int8 
 4   4       5680 non-null   int8 
 5   5       5680 non-null   int8 
 6   6       5680 non-null   int8 
 7   7       5680 non-null   int8 
 8   8       5680 non-null   int8 
 9   9       5680 non-null   int8 
 10  10      5680 non-null   int8 
 11  11      5680 non-null   int8 
 12  12      5680 non-null   int8 
dtypes: int8(13)
memory usage: 135.5 KB


The output dataframe has two indices: 
1. Residue IDs of the query group
2. Lipid IDs of the database group

Columns are all of the frames in the trajectory. The values are the number of contacts between the query and database residues.
Only residues that have at least one contact are included in the output dataframe. 

Note that the above DataFrame provides a complete description of your system with the given cutoff. You can use this DataFrame to perform any analysis you want, and do not need to use the ProLint API, if that is all you need.

In [12]:
import numpy as np
import pandas as pd

def get_lipids_by_residue_id(df: pd.DataFrame, residue_id: int) -> list:
    # Get all LipidIds that interact with the given ResidueID
    lipids = df.loc[residue_id].index.tolist()
    return lipids

def get_residues_by_lipid_id(df: pd.DataFrame, lipid_id: int) -> list:
    # Get all ResidueIDs that interact with the given LipidId
    residues = df.xs(lipid_id, level='LipidId', axis=0).index.tolist()
    return residues

def get_contact_data(df: pd.DataFrame, residue_id: int, lipid_id: int, output: str = 'contacts') -> np.array:
    # Get contact column as a numpy array or the indices of 1's in the column
    contact_array = df.loc[(residue_id, lipid_id)].to_numpy()

    if output == 'indices':
        return np.nonzero(contact_array)[0]
    else:
        return contact_array


In [22]:
lipid_ids = get_lipids_by_residue_id(df, 18) # all lipids that interact with residue id 18
residue_ids = get_residues_by_lipid_id(df, 2594) # all residues that interact with lipid id 2594
indices = get_contact_data(df, 18, 2594, output='indices') # indices of contacts between residue id 18 and lipid id 2594

Note that these functions are also available from `contacts` instance created above, and they are faster compared to using the DataFrame. The idea here is that you can use the DataFrame to perform any analysis you want, since as mentioned above it provides a complete description of your system. 

In the next notebook, we will look at how you can modify the query and database to get more customized results.