## `prolintpy`:  loading data and topology description

`prolintpy` relies on `MDTraj` to read input data files, as such, it is the only module that has to be imported alongside `prolintpy`: 

In [1]:
import mdtraj as md
import prolintpy as pl

Load the data using MDTraj

In [2]:
t = md.load('./data/test_data_1.xtc', top='./data/test_data_1.gro')

In [3]:
t

<mdtraj.Trajectory with 17 frames, 23820 atoms, 3240 residues, and unitcells at 0x25aad61e400>

## Load the data to prolintpy and define the protein and lipid topology

We first specify the resolution of the input data and indicate if we want to combine the proteins (only applicable if there are more than one protein in the system). 
<br>Combining proteins will result in the calculated metrics being averages of all copies. 

In [4]:
resolution = "martini"
combine_proteins = False
lipids = pl.Lipids(t.topology, resolution=resolution)
proteins = pl.Proteins(t.topology, resolution=resolution).system_proteins(merge=combine_proteins)

## Extract information about the input system

Get all the lipid residues in the system

In [5]:
lipids.lipid_names()

array(['POPE', 'POPS', 'CHOL'], dtype=object)

Get the names of the different lipids as well as their count

In [6]:
lipids.lipid_count()

{'POPE': 652, 'POPS': 652, 'CHOL': 652}

Get a pandas DataFrame for the defined systems

In [7]:
lipids.ldf.head()

Unnamed: 0,serial,name,element,resSeq,resName,chainID,segmentID
2956,2957,NH3,N,1285,POPE,0,
2957,2958,PO4,P,1285,POPE,0,
2958,2959,GL1,VS,1285,POPE,0,
2959,2960,GL2,VS,1285,POPE,0,
2960,2961,C1A,C,1285,POPE,0,


Retrieve the residue IDs of all cholesterol lipids

In [8]:
lipids.ldf[lipids.ldf.resName == "CHOL"].resSeq.unique()

array([1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947,
       1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958,
       1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969,
       1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980,
       1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
       1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002,
       2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
       2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024,
       2025, 2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035,
       2036, 2037, 2038, 2039, 2040, 2041, 2042, 2043, 2044, 2045, 2046,
       2047, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057,
       2058, 2059, 2060, 2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068,
       2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079,
       2080, 2081, 2082, 2083, 2084, 2085, 2086, 20

List the proteins found in the system and store the first one (the only one here) in a variable. 
`prolintpy` derives topology information for proteins from the input coordinate file. 
<br>Two proteins will be considered the same if they are entirely identical (same number of residues, completely identical order and type of atoms/beads). 

In [9]:
proteins

[<prolintpy.Protein containing 1 replicate(s) of Protein0 and 1284 beads each>]

In [10]:
protein = proteins[0]

Get various protein information. Note that to get a dataframe 

In [11]:
protein.name = "GIRK" # Give the protein a better name

In [12]:
protein.n_residues

1284

In [13]:
print (protein.first_residue, protein.last_residue)

1 1284


In [14]:
protein.counter

1

Get the indices for residues 50, 60, and 70

In [15]:
protein.get_indices([50, 60, 70])

Using the available dataframe


[array([124, 125], dtype=int64),
 array([155, 156, 157, 158, 159], dtype=int64),
 array([179, 180], dtype=int64)]

### Why `prolintpy` is easy to scale-up 

If the input system contains only one copy of only one protein type (as in this example) then `proteins` will be a list of only one element. This entails a little bit extra work to get the protein out of the list, but provides much more flexibility in handling more complex system setups. You can use the `counter` option alongside the length of the `proteins` list to extract information about proteins in the system dynamically.

For instance, to get a DataFrame representation for each protein in the system dynamically (that is without knowing anything about the composition of the input system), we can do that very easily<br>
One way of doing that is the following syntax: 

In [16]:
def get_dataframes(proteins):
    """
    Takes as input a prolintpy.Protein object and returns a list of DataFrame elements 
    for each copy of each protein in the system. 
    """
    dataframe_list = [protein.dataframe[protein_copy] for protein in proteins for protein_copy in range(protein.counter)]
    return dataframe_list

In [17]:
# returns a list of DataFrame elements 
get_dataframes(proteins)

[      serial name element  resSeq resName  chainID segmentID
 0          1   BB       B       1     ARG        0          
 1          2  SC1       S       1     ARG        0          
 2          3  SC2       S       1     ARG        0          
 3          4   BB       B       2     GLN        0          
 4          5  SC1       S       2     GLN        0          
 5          6   BB       B       3     ARG        0          
 6          7  SC1       S       3     ARG        0          
 7          8  SC2       S       3     ARG        0          
 8          9   BB       B       4     TYR        0          
 9         10  SC1       S       4     TYR        0          
 10        11  SC2       S       4     TYR        0          
 11        12  SC3       S       4     TYR        0          
 12        13   BB       B       5     MET        0          
 13        14  SC1       S       5     MET        0          
 14        15   BB       B       6     GLU        0          
 15     

# Exercise

Test the above commands using a system that contains multiple proteins in different number of copies/replicates. In particular, test the function `get_dataframes()` and how it works for such systems. <br>
The test files `test_data_2.xtc`  and `test_data_2.gro` contain a system that has four copies/replicates of one protein type. Note how a very simple function that we wrote allows us to have complete access to the protein topology of system. 

In [18]:
t = md.load('./data/test_data_2.xtc', top='./data/test_data_2.gro')

In [19]:
t

<mdtraj.Trajectory with 26 frames, 59412 atoms, 6026 residues, and unitcells at 0x25ab4e02668>