# Molecules, Graphs, and MoleculeSets

At the heart of PyProteoNet are types of molecules (like proteins and peptides) and connections between those molecules. Even though for PyProteoNet molecules
are just nodes in a graph and molecule types are just strings we will focus on protein and peptide molecules as used
during many mass spectrometry (MS) experiments in the field of proteomics.
In addition, most functions of PyProteonNet are currenlty focused on proteins and peptides aggregation and imputation.

Ultimately most experiments want to measure protein abundances. However for most MS experiments proteins are digested into peptides
because only those smaller peptides are measured during an MS-experiment. So we have two kind of molecules (proteins and peptides) plus a mapping between
them because every peptide can be mapped to at least one protein which the peptide can result from during digestion.

In PyProteoNet such a group of different types of molecules together with mappings between those molecule types is represented by a `MoleculeSet`.
So lets import this:

In [1]:
from pyproteonet.data import MoleculeSet

Next we need some data to create a MoleculeSet. For simplicity, we do not use real data but come up with a simple toy examples
of 10 proteins and 100 peptides which we identify by integers.

In [2]:
import pandas as pd

proteins = pd.DataFrame(index=range(10))
peptides = pd.DataFrame(index=range(100))

As you can see proteins and peptides are represented by pandas dataframes. The index of the dataframes will be used as identifiers
for our molecules. With real-world data it might make sense to use the protein name or peptide sequence as index but here we just
use integers. Our dataframe could also have additional columns storing other molecule attributes, however, those are not required.

The only thing missing is a mapping between proteins and peptides.
Mappings are also created from pandas dataframes. To identify mapping partners those dataframes must have a multiindex with every index level containing molecules ids of the mapped molecules.

Here we just map every 10th peptide to the same protein.

In [3]:
peptide_protein_mapping = pd.DataFrame({'peptide':peptides.index, 'protein':peptides.index%10}).set_index(['peptide', 'protein'])

In [4]:
peptide_protein_mapping

peptide,protein
0,0
1,1
2,2
3,3
4,4
...,...
95,5
96,6
97,7
98,8


> **Side Note**: Internally mappings are wrapped by the `MoleculeMapping` class. This wrapping class can hold some additional information and facilitates e.g. mappings between the same molecule type. However, for most simple use cases like the protein-peptide use case this can be ignored.

From this data we can now generate a MoleculeSet. To support arbitratry molecule types and multiple mappings, molecules and
mappings need to be given as dictionaries.

In [5]:
ms = MoleculeSet(molecules = {'protein':proteins, 'peptide':peptides},
                 mappings = {'peptide-protein': peptide_protein_mapping}
                )

# Samples and Values

The MoleculeSet alone is not that helpful. Usually, we also want to attach (abundance) values to our molecules. 
With MS experiments we usually even have multiple samples measuring the same value multiple times. For this PyProteoNet provides
`Dataset`s. A `Dataset` consists of a molecule graph and a variable number of `DatasetSample`s each representing values of one sample.
We can create a `Dataset` without any samples as follows:

In [6]:
from pyproteonet.data import Dataset
ds = Dataset(molecule_set=ms)

Next we add some samples to the dataset. Every sample is identified by a name and contains dataframes with values for our
different molecule types. E.g. lets assume we measured some abundance values for our peptides which we want to add.

In [7]:
import numpy as np
for i in range(3):
    sample_name = f'sample{i}'
    peptide_values = pd.DataFrame({'abundance': np.random.uniform(size=100) * 10000}, index=range(100))
    sample_molecule_values = {'peptide':peptide_values}
    ds.create_sample(name=sample_name, values=sample_molecule_values)

Again we give the sample values as a dictionary to assign them to the correct molecule type.

# Protein Aggregation

Usually we are interested in protein abundances. Therefore, the measured peptide abundance values need to be aggregated into protein abundance values.
This is done via the peptide-protein mapping using a aggregation or quantification function.

Here we apply Top3 aggregation as a simple but commonly used aggregation function. This function computes protein abundance from the average of the three most abundant peptides corresponding to a protein. In real-world datasets some peptides are usually shared between different proteins. Since their abundance values cannot be uniquly assigned to a protein, shared peptides are often ignored during abundane aggregation and only unique peptides are considered.

The result is represented as a pandas Series in long format with a multiindex to identify samples and protein ids.

In [8]:
from pyproteonet.quantification.neighbor_summarization import neighbor_top_n_mean
top3 = neighbor_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance',
                           top_n=3, only_unique=True)

In [9]:
top3

sample   id
sample0  0     9094.317599
         1     9041.109171
         2     8110.967893
         3     8888.045253
         4     6414.815315
         5     7594.530335
         6     8569.054925
         7     7194.720128
         8     5343.510530
         9     7224.719882
sample1  0     8639.711048
         1     8131.990755
         2     9187.682710
         3     9176.357375
         4     9367.159603
         5     5965.732080
         6     9126.903228
         7     9627.654676
         8     8825.260362
         9     8471.054487
sample2  0     7155.455770
         1     7139.038511
         2     9301.488098
         3     9053.599399
         4     6923.564320
         5     7170.520557
         6     8188.727798
         7     9009.645698
         8     8774.985146
         9     8444.135622
Name: quanti, dtype: float64

To assign this to our dataset the following syntax can be used:

In [10]:
ds.values['protein']['top3'] = top3

Alternatively, most functions also allow the direct specification of a result column. So an alternative formulation of the Top3 aggregation could be as follows:

In [11]:
_ = neighbor_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance',
                        top_n=3, only_unique=True, result_column='top3')

The long format of the Top3 aggregated protein abundance values is little intuitive and a matrix representation is often used instead. Therefore, the `Dataset` class provides functions to represent data in different formats allowing. To get the Top3 results as a pandas DataFrame with samples as columns ans proteins as rows the `get_samples_value_matrix` function can be useful:

In [12]:
ds.get_samples_value_matrix(molecule='protein', column='top3')

Unnamed: 0_level_0,sample0,sample1,sample2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,9094.317599,8639.711048,7155.45577
1,9041.109171,8131.990755,7139.038511
2,8110.967893,9187.68271,9301.488098
3,8888.045253,9176.357375,9053.599399
4,6414.815315,9367.159603,6923.56432
5,7594.530335,5965.73208,7170.520557
6,8569.054925,9126.903228,8188.727798
7,7194.720128,9627.654676,9009.645698
8,5343.51053,8825.260362,8774.985146
9,7224.719882,8471.054487,8444.135622


Next to the rather simple neighbor average (Top3), PyProteoNet also provides an efficient implementation of the more complex MaxLFQ protein aggregation method propose by [Cox et al.](https://www.mcponline.org/article/S1535-9476(20)33310-7/fulltext).

In [13]:
from pyproteonet.quantification.maxlfq import maxlfq
_ = maxlfq(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance',
           min_ratios=2, median_fallback=False, result_column='maxlfq')

  grouping = mask_group(groupings[group_idx])
