# Molecules, Graphs, and MoleculeSets

At the heart of PyProteoNet are molecules and connections between this molecules. Even though for PyProteoNet molecules
are just nodes in a graph and molecule types are just strings we will focus on protein and peptide molecules as used
during many mass spectrometry (MS) experiments in the field of proteomics.
In addition, most functions of PyProteonNet are currenlty focused on proteins and peptides quantification and imputation.

Ultimately most experiments want to measure protein abundances. However for most MS experiments proteins are digested into peptides
because only those smaller peptides can be measured. So we have two kind of molecules (proteins and peptides) plus a mapping between
those molecules. With proteins and peptides every peptide can be mapped to at least one protein which the peptide can result from during digestion.

In PyProteoNet such a group of different types of molecules together with mappings between those molecule types is represented by a `MoleculeSet`.
So lets import this:

In [2]:
from pyproteonet.data import MoleculeSet

Next we need some data to create a MoleculeSet. For simplicity, we do not use real data but come up with a simple toy examples
of 10 proteins and 100 peptides which we identify by integers.

In [1]:
import pandas as pd

proteins = pd.DataFrame(index=range(10))
peptides = pd.DataFrame(index=range(100))

As you can see proteins and peptides are represented by pandas dataframes. The index of the dataframes will be used as identifiers
for our molecules. With real-world data it might make sense to use the protein name or peptide sequence as index but here we just
use integers. Our dataframe could also have additional columns storing other molecule attributes, however, those are not required.

The only thing missing is a mapping between proteins and peptides.
Mappings are represented by mapping every molecule to a mapping identifier. Such a mapping via a mapping identifier does for example allow
relating protein and peptides via a third entity like a gene. In this case the gene id would be the mapping identifier.
However, in our case we just want to directly use the protein-peptide relation. Therefore we map both proteins and peptides to protein ids.
For proteins this basically means using the identity function as mapping.

In [14]:
protein_protein_mapping = pd.DataFrame({'id':proteins.index, 'map_id':proteins.index}) #identity mapping
peptide_protein_mapping = pd.DataFrame({'id':peptides.index, 'map_id':peptides.index%10})

As with the molecules, mappings are also represented by pandas dataframes. Mapping dataframes require a "id" column relating to
the index of a molecule type and a "map_id" mapping molecules to some mapping identifiers.
Here we map every tenth peptide to the same protein.

From this data we can now generate a MoleculeSet. To support arbitratry molecule types and multiple mappings, molecules and
mappings need to be given as dictionaries.

In [16]:
ms = MoleculeSet(molecules = {'protein':proteins, 'peptide':peptides},
                 mappings = {'protein_mapping': {'protein':protein_protein_mapping, 'peptide':peptide_protein_mapping}}
                )

# Samples and Values

The MoleculeSet alone is not that helpful. Usually, we also want to attach (measured) values to our molecules. 
With MS experiments we usually even have multiple samples measuring the same value multiple times. For this PyProteoNet provides
`Dataset`s. A `Dataset` consists of a molecule graph and a variable number of `DatasetSample`s each representing values of one sample.
We can create a `Dataset` without any samples as follows:

In [19]:
from pyproteonet.data import Dataset
ds = Dataset(molecule_set=ms)

Next we add some samples to the dataset. Every sample is identified by a name and contains dataframes with values for our
different molecule types. E.g. lets assume we measured some abundance values for our peptides which we want to add.

In [22]:
import numpy as np
for i in range(3):
    sample_name = f'sample{i}'
    peptide_values = pd.DataFrame({'abundance': np.random.uniform(size=100) * 10000}, index=range(100))
    sample_molecule_values = {'peptide':peptide_values}
    ds.create_sample(name=sample_name, values=sample_molecule_values)

Again we give the sample values as a dictionary to assign them to the correct molecule type.

Lets also compute some protein abundances by applying a simple quantification function. In this example we will simply average all peptides
belonging to a protein. Looking at it from a graph perspective, we are averaging the one hop neighborhood of proteins.

In [27]:
from pyproteonet.processing.aggregation import neighbor_mean
ds = neighbor_mean(dataset=ds, input_molecule='peptide', input_column='abundance',
                   result_molecule='protein', result_column='abundance',
                   mapping='protein_mapping')

The `neighbor_mean(...)` function automatically got applied to all samples resulting in a new dataset. To inspect the results
for one sample we can just index the dataset with the sample name.

In [30]:
sample = ds['sample0']
sample

<pyproteonet.data.dataset_sample.DatasetSample at 0x7f6d3949b890>

This results into `DatasetSample`. Every `DatasetSample` has a `values` property which holds a dictionary with all the
value-dataframes for all molecule types. So to get the protein values we can do:

In [33]:
sample.values['protein']

Unnamed: 0,abundance
0,3578.954072
1,6330.201144
2,3717.998176
3,3873.220958
4,3892.350814
5,5454.61167
6,4936.436778
7,5646.796905
8,4125.000959
9,5995.243607


This concludes the introduction. Hopefully it gave you an idea of the underlying concept and main data structures of PyProteoNet.
To get an overview of all the properties and methods of `MoleculeSet`, `DatasetSample`, and `Dataset` you can have a look into
the API reference.