In [51]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Getting Started

## Molecules, Graphs, and MoleculeSets

At the heart of PyProteoNet are types of molecules (like proteins and peptides) and connections between those molecules. Even though for PyProteoNet molecules
are just nodes in a graph and molecule types are just strings we will focus on protein and peptide molecules as used
during many mass spectrometry (MS) experiments in the field of proteomics.
In addition, most functions of PyProteonNet are currenlty focused on proteins and peptides aggregation and imputation.

Ultimately most experiments want to measure protein abundances. However for most MS experiments proteins are digested into peptides
because only those smaller peptides are measured during an MS-experiment. So we have two kind of molecules (proteins and peptides) plus a mapping between
them because every peptide can be mapped to at least one protein which the peptide can result from during digestion.

In PyProteoNet such a group of different types of molecules together with mappings between those molecule types is represented by a `MoleculeSet`.
So lets import this:

In [52]:
from pyproteonet.data import MoleculeSet

Next we need some data to create a MoleculeSet. For simplicity, we do not use real data but come up with a simple toy examples
of 10 proteins and 100 peptides which we identify by integers.

In [53]:
import pandas as pd

proteins = pd.DataFrame(index=range(10))
peptides = pd.DataFrame(index=range(100))

As you can see proteins and peptides are represented by pandas dataframes. The index of the dataframes will be used as identifiers
for our molecules. With real-world data it might make sense to use the protein name or peptide sequence as index but here we just
use integers. Our dataframe could also have additional columns storing other molecule attributes, however, those are not required.

The only thing missing is a mapping between proteins and peptides.
Mappings are also created from pandas dataframes. To identify mapping partners those dataframes must have a multiindex with every index level containing molecules ids of the mapped molecules.

Here we just map every 10th peptide to the same protein.

In [54]:
peptide_protein_mapping = pd.DataFrame({'peptide':peptides.index, 'protein':peptides.index%10}).set_index(['peptide', 'protein'])

In [55]:
peptide_protein_mapping

peptide,protein
0,0
1,1
2,2
3,3
4,4
...,...
95,5
96,6
97,7
98,8


> **Side Note**: Internally mappings are wrapped by the `MoleculeMapping` class. This wrapper class can hold some additional information and facilitates e.g. mappings between the same molecule type. However, for most simple use cases like the protein-peptide use case this can be ignored.

From this data we can now generate a MoleculeSet. To support arbitratry molecule types and multiple mappings, molecules and
mappings need to be given as dictionaries.

In [56]:
ms = MoleculeSet(molecules = {'protein':proteins, 'peptide':peptides},
                 mappings = {'peptide-protein': peptide_protein_mapping}
                )

## Samples and Values

The MoleculeSet alone is not that helpful. Usually, we also want to attach (abundance) values to our molecules. 
With MS experiments we usually even have multiple samples measuring the same value multiple times. For this PyProteoNet provides
`Dataset`s. A `Dataset` consists of a molecule graph and a variable number of `DatasetSample`s each representing values of one sample.
We can create a `Dataset` without any samples as follows:

In [57]:
from pyproteonet.data import Dataset
ds = Dataset(molecule_set=ms)

Next we add some samples to the dataset. Every sample is identified by a name and contains dataframes with values for our
different molecule types. E.g. lets assume we measured some abundance values for our peptides which we want to add.

In [58]:
import numpy as np
for i in range(3):
    sample_name = f'sample{i}'
    peptide_values = pd.DataFrame({'abundance': np.random.uniform(size=100) * 10000}, index=range(100))
    sample_molecule_values = {'peptide':peptide_values}
    ds.create_sample(name=sample_name, values=sample_molecule_values)

Again we give the sample values as a dictionary to assign them to the correct molecule type.

## Creating a Dataset Directly from Pandas Dataframes

Alternatively, we can just create a dataset from abundance matrices (for proteins and peptides) given as pandas DataFrames and mappings

In [59]:
peptide_abundance = pd.DataFrame(np.random.uniform(size=(100, 10)) * 10000, index=range(100), columns=[f'sample{i}' for i in range(10)])
peptide_abundance

Unnamed: 0,sample0,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9
0,7238.747,9477.294,2482.859,2813.936,4393.726,9029.044,1113.775,7041.178,6878.413,8024.203
1,321.908,7953.998,6135.773,7011.221,975.855,2266.122,4920.040,3304.337,694.311,230.674
2,9593.379,7722.989,3229.229,7529.776,3863.926,7613.917,2756.370,3592.661,4966.780,9382.149
3,7811.972,4188.765,6074.888,8542.109,3452.596,1447.429,6121.701,4387.109,9578.632,6511.629
4,351.829,6345.102,5339.551,9053.414,54.181,221.845,5506.478,5259.496,2272.418,8904.757
...,...,...,...,...,...,...,...,...,...,...
95,765.926,1498.311,6617.097,7533.247,9910.481,5687.050,7992.420,1737.802,1155.065,7174.543
96,9243.786,5982.204,4186.412,4564.316,7313.234,1794.702,7797.560,8233.446,8213.435,3201.958
97,4453.980,4693.264,1459.200,5572.061,4018.447,2316.927,2034.111,7180.904,7644.204,6497.266
98,2473.045,1036.253,283.278,5436.428,608.993,3627.040,762.640,7235.080,1142.041,347.414


In [60]:
ds = Dataset.from_pandas(dfs={'peptide':{'abundance':peptide_abundance}}, mappings={'peptide-protein': peptide_protein_mapping})

## Protein Aggregation

Since in proteomics it is often worked with logarithmic abundance values, also the peptide abundances of our artifical dataset should first be logarithmized.

This also helps to understand how to access and modify dataset values in PyProteoNet. One convinient way shown here is to access a single value field or column as padas DataFrame in long format, containing all abundance values with their sample and protein id as multi index. Logarithmization can then be done as shown below, saving the logarithmized values under a new column.

In [61]:
ds.values['peptide']['abundance_log'] = np.log(ds.values['peptide']['abundance'])

# as alternative PyProteoNet also provides a function to logarithmize whole datasets
from pyproteonet.processing import logarithmize
ds_log = logarithmize(ds)

Usually we are interested in protein abundances. Therefore, the measured peptide abundance values need to be aggregated into protein abundance values.
This is done via the peptide-protein mapping using an aggregation function (also called quantification function).

Here we apply Top3 aggregation as a simple but commonly used aggregation function. This function computes protein abundance from the average of the three most abundant peptides corresponding to a protein. In real-world datasets some peptides are usually shared between different proteins. Since their abundance values cannot be uniquly assigned to a protein, shared peptides are often ignored during abundane aggregation and only unique peptides are considered.

The result is represented as a pandas Series in long format with a multiindex to identify samples and protein ids.

In [62]:
from pyproteonet.aggregation.partner_summarization import partner_top_n_mean
top3 = partner_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
                          top_n=3, only_unique=True)

In [63]:
top3

sample   id
sample0  0    8.973
         1    8.965
         2    9.147
         3    9.063
         4    8.800
               ... 
sample9  5    8.874
         6    8.961
         7    9.136
         8    9.041
         9    9.055
Name: quanti, Length: 100, dtype: float64

To assign this to our dataset the following syntax can be used:

In [64]:
ds.values['protein']['top3'] = top3

Alternatively, most functions also allow the direct specification of a result column. So an alternative formulation of the Top3 aggregation could be as follows:

In [65]:
_ = partner_top_n_mean(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
                       top_n=3, only_unique=True, result_column='top3')

The long format of the Top3 aggregated protein abundance values is little intuitive and a matrix representation is often used instead. Therefore, the `Dataset` class provides functions to represent data in different formats allowing. To get the Top3 results as a pandas DataFrame with samples as columns ans proteins as rows the `get_samples_value_matrix` function can be useful:

In [66]:
ds.get_samples_value_matrix(molecule='protein', column='top3')

Unnamed: 0_level_0,sample0,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,8.973,9.051,8.868,9.093,9.108,9.031,9.02,8.917,8.922,9.038
1,8.965,8.975,9.11,9.09,9.069,9.056,8.983,8.702,9.082,9.05
2,9.147,8.979,8.948,9.096,9.115,9.049,8.966,8.962,9.068,9.124
3,9.063,8.585,9.034,8.866,8.872,9.092,9.087,9.099,9.138,8.963
4,8.8,9.119,9.061,9.122,8.678,9.083,8.854,9.02,9.081,8.954
5,8.937,9.034,8.908,9.078,9.03,8.701,9.137,9.002,8.971,8.874
6,9.023,8.764,9.097,8.998,8.884,9.063,9.087,9.068,8.967,8.961
7,9.057,9.023,9.053,8.993,8.863,8.994,9.146,9.021,9.083,9.136
8,8.969,8.928,9.107,9.113,9.099,9.03,9.129,9.017,8.582,9.041
9,8.82,9.059,9.174,8.972,8.991,9.035,8.928,8.965,8.845,9.055


Next to the rather simple neighbor average (Top3), PyProteoNet also provides an efficient implementation of the more complex MaxLFQ protein aggregation method propose by [Cox et al.](https://www.mcponline.org/article/S1535-9476(20)33310-7/fulltext).

This methods takes peptide abundance ratios between all samples into account and then solves a least squares optimization probel to find protein abundance values best representing the observed peptide abundances. 
Similar to [Cox et al.](https://www.mcponline.org/article/S1535-9476(20)33310-7/fulltext) we here require at least two non missing peptide abundances. Additionally, we need to specify that the given values are already logarithmic to allows the correct calculation of peptide ratios between samples.

In [67]:
from pyproteonet.aggregation import maxlfq
_ = maxlfq(dataset=ds, molecule='protein', mapping='peptide-protein', partner_column='abundance_log',
           min_ratios=2, median_fallback=False, result_column='maxlfq', is_log=True)

Next to the matrix representation we can also look at all columns of a molecule using its long format representation as pandas DataFrame with multiindex. To do so we can again use the `values` attribute of our dataset. Using the `df` shortcut, we get a DataFrame of all columns for a molecule type in long format.

In [68]:
ds.values['protein'].df

Unnamed: 0_level_0,Unnamed: 1_level_0,top3,maxlfq
sample,id,Unnamed: 2_level_1,Unnamed: 3_level_1
sample0,0,8.973,8.369
sample0,1,8.965,7.666
sample0,2,9.147,7.573
sample0,3,9.063,8.418
sample0,4,8.800,8.036
...,...,...,...
sample9,5,8.874,8.315
sample9,6,8.961,7.841
sample9,7,9.136,8.543
sample9,8,9.041,8.223


## Missing Value Imputation

Pyproteonet provides a wide range of established as well as newly proposed, graph neural network (GNN) based missing value imputation functions.

The interface for most imputation functions is similar to this of the aggregation functions shown above. Next to a dataset you need to provide the molecole type as well as the column(s) to impute and, optionally, values for method specific hyperparameters.

Of course, for imputation, we first of all need some missing values. So for our example we just mask some of the Top3 values. To do so we, again, use the `values` attribute of our dataset to get all Top3 values as pandas DataFrame in long format. Then, we replace some of the values with `Na` using pandas and, finally, we write the result back as a new column in our dataset

In [69]:
vals = ds.values['protein']['top3']
vals.loc[vals.sample(frac=0.33).index] = np.nan
ds.values['protein']['top3_masked'] = vals

In [70]:
ds.values['protein'].df

Unnamed: 0_level_0,Unnamed: 1_level_0,top3,maxlfq,top3_masked
sample,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sample0,0,8.973,8.369,8.973
sample0,1,8.965,7.666,
sample0,2,9.147,7.573,9.147
sample0,3,9.063,8.418,
sample0,4,8.800,8.036,
...,...,...,...,...
sample9,5,8.874,8.315,8.874
sample9,6,8.961,7.841,8.961
sample9,7,9.136,8.543,
sample9,8,9.041,8.223,9.041


Let's use the commonly used MissForrest as well as BPCA imputation methods to impute missing values.
PyProteoNet provides a high level api for common imputation function operating only on a single molecule type (e.g. on protein level).

In [71]:
from pyproteonet.imputation.high_level_api import impute_molecule

impute_molecule(dataset=ds, molecule='protein', column='top3_masked', methods=['missforest', 'bpca'],
                result_columns=['top3_missforest', 'top3_bpca'])

  0%|          | 0/2 [00:00<?, ?it/s]

Imputing with methods missforest, storig results in value column top3_missforest
Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Imputing with methods bpca, storig results in value column top3_bpca


Alternatively, PyProteoNet provides seperate functions for all imputation methods, allowing for example specifying additional argumets

In [72]:
from pyproteonet.imputation.r.miss_forest import miss_forest_impute

ds.values['protein']['top3_missforest'] = miss_forest_impute(dataset=ds, molecule='protein', column='top3_masked', ntree=5)

Looking at the result we can see that the missing values are gone:

In [73]:
ds.values['protein'].df

Unnamed: 0_level_0,Unnamed: 1_level_0,top3,maxlfq,top3_masked,top3_missforest,top3_bpca
sample,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
sample0,0,8.973,8.369,8.973,8.973,8.973
sample0,1,8.965,7.666,,8.992,9.034
sample0,2,9.147,7.573,9.147,9.147,9.147
sample0,3,9.063,8.418,,9.046,9.034
sample0,4,8.800,8.036,,9.073,9.034
...,...,...,...,...,...,...
sample9,5,8.874,8.315,8.874,8.874,8.874
sample9,6,8.961,7.841,8.961,8.961,8.961
sample9,7,9.136,8.543,,9.050,9.008
sample9,8,9.041,8.223,9.041,9.041,9.041


If you look at the import of the impute_miss_forest function you will notice that this function is part of the "r" subpackage.
Most established imputation algorithms are implemented in the R programming language and provided as R packages. To also provide those algorithms while maintaining a unified Python interfact PyProteoNet wraps those R packages. 
Therefore, all algorithms in the "r" subpackage require an existing R installation (if you use a conda/mamba environment you could simply install one with the command `conda install -c conda-forge r-base`). For the user it is transparent whether an imputation function is implemented in Python or wrapped from an R package. All installation of R dependencies and conversion of data types between Python and R is done in the background by PyProteoNet (internally the rpy2 package is used for this).

> **Note**: Compared to other proteomics- or imputation-focused packages PyProteoNet allows to jointly manage protein and peptide values as well as the relation between them (plus any other additional molecule types if required). This together with the implemented aggragation and imputation functions allows for a more versatile usage scenarios. E.g. imputation can be applied both on peptide level (before aggregation) as well as on protein level (after aggregation). In addition, the application and comparision of different imputation algorithmis and stratigies on the same dataset is facilitated

## Graph Neural Network Imputation

While traditionally, imputation is either applied on peptide OR on protein level modelling protein and peptides a graph structure allows for flexilbe imputation strategies jointly taking information from both molecules into account. Therefore, imputation is formulated as a regression problem on the protein-peptide graph which is then solved by training a graph neural network (GNN).

While PyProteoNet provides different flavors and implementations using different network architectures and training schemes for the underlying GNN, those imputation methods can be called via a similar interface as other imputation methods.
Since two types of molecules (proteins and peptides) are taken into account, the name of those molecule types as well as two value columns have to be specified.

Additional hyperparameters can be set aswell (here set to values used for real-world datasets)

In [74]:
from pyproteonet.imputation.dnn.gnn import impute_heterogeneous_gnn

# For the toy example with random data we use a small number of epochs and patience
impute_heterogeneous_gnn(dataset=ds, molecule='protein', column='top3_masked', mapping='peptide-protein', partner_column='abundance_log',
                         molecule_result_column=f'gnn_hetero', partner_result_column=f'gnn_hetero',
                         max_epochs=3, early_stopping_patience=3)

seed: 881061604


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type            | Params
------------------------------------------------------
0 | embedding         | Embedding       | 50    
1 | molecule_fc_model | Sequential      | 11.0 K
2 | partner_fc_model  | Sequential      | 11.4 K
3 | molecule_gat      | HeteroGraphConv | 34.4 K
4 | partner_gat       | HeteroGraphConv | 50.4 K
5 | molecule_gat2     | HeteroGraphConv | 66.4 K
6 | molecule_linear   | Linear          | 820   
7 | partner_linear    | Linear          | 1.2 K 
8 | loss_fn           | GaussianNLLLoss | 0     
------------------------------------------------------
175 K     Trainable params
0         Non-trainable params
175 K     Total params
0.703     Total estimated model params size (MB)


Training: |                                                                                                   …

step29: num_masked_molecule:67.000 || num_masked_partner:239.167 || molecule_loss:0.466 || partner_loss:0.512 || train_loss:0.978 || epoch:0.000 || 
step59: num_masked_molecule:67.000 || num_masked_partner:235.867 || molecule_loss:0.385 || partner_loss:0.470 || train_loss:0.855 || epoch:1.000 || 


`Trainer.fit` stopped: `max_epochs=3` reached.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


step89: num_masked_molecule:67.000 || num_masked_partner:249.233 || molecule_loss:0.298 || partner_loss:0.452 || train_loss:0.749 || epoch:2.000 || 


Predicting: |                                                                                                 …

sample   id
sample0  0    8.973
sample1  0    9.051
sample2  0    9.041
sample3  0    9.093
sample4  0    9.108
               ... 
sample5  9    9.035
sample6  9    8.928
sample7  9    8.958
sample8  9    8.994
sample9  9    9.055
Length: 100, dtype: float64

In [75]:
ds.values['protein'].df

Unnamed: 0_level_0,Unnamed: 1_level_0,top3,maxlfq,top3_masked,top3_missforest,top3_bpca,gnn_hetero
sample,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
sample0,0,8.973,8.369,8.973,8.973,8.973,8.973
sample0,1,8.965,7.666,,8.992,9.034,9.029
sample0,2,9.147,7.573,9.147,9.147,9.147,9.147
sample0,3,9.063,8.418,,9.046,9.034,9.029
sample0,4,8.800,8.036,,9.073,9.034,9.030
...,...,...,...,...,...,...,...
sample9,5,8.874,8.315,8.874,8.874,8.874,8.874
sample9,6,8.961,7.841,8.961,8.961,8.961,8.961
sample9,7,9.136,8.543,,9.050,9.008,9.007
sample9,8,9.041,8.223,9.041,9.041,9.041,9.041
