# Tutorial for Dev: Peptide and Fragment DataFrames

This notebook introduces functionalities for peptide and fragment DataFrames to developers.

## Peptide DataFrame

Peptide dataframe must contain four columns: `sequence` for animo acid sequence (str), `mods` for modification names (str), `mod_sites` for modification sites (str), and `charge` for precursor charge states (int).

We can easily build a peptide dataframe:

In [1]:
import pandas as pd

df = pd.DataFrame({
    'sequence': ['ACDEFHIK', 'APDEFMNIK', 'SWDEFMNTIRAAAAKDDDDR'],
    'mods': ['Carbamidomethyl@C', '', 'Phospho@S;Oxidation@M'],
    'mod_sites': ['2', '', '1;6'],
    'charge': [1,2,3],
})
df

Unnamed: 0,sequence,mods,mod_sites,charge
0,ACDEFHIK,Carbamidomethyl@C,2,1
1,APDEFMNIK,,,2
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3


### Calculate precursor_mz and isotopes from peptide dataframe

`alphabase.peptide.precursor.update_precursor_mz()` calculates the precursor_mz for peptides.

In [2]:
from alphabase.peptide.precursor import update_precursor_mz

update_precursor_mz(df)

Unnamed: 0,sequence,mods,mod_sites,charge,nAA,precursor_mz
0,ACDEFHIK,Carbamidomethyl@C,2,1,8,1019.461492
1,APDEFMNIK,,,2,9,532.757692
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3,20,808.337166


`alphabase.peptide.precursor.calc_precursor_isotope()` calculates the precursor isotope information for peptides. It will add `isotope_*` columns for peptides.

- isotope_m1_intensity: relative intensity of M1 to mono peak
- isotope_m1_mz: mz of M1
- isotope_apex_intensity: relative intensity of the apex peak
- isotope_apex_mz: mz of the apex peak
- isotope_apex_offset: position offset of the apex peak to mono peak
- isotope_right_most_intensity: relative intensity of the right-most peak
- isotope_right_most_mz: mz of the right-most peak
- isotope_right_most_offset: position offset of the right-most peak

In [3]:
from alphabase.peptide.precursor import calc_precursor_isotope

calc_precursor_isotope(df)

Unnamed: 0,sequence,mods,mod_sites,charge,nAA,precursor_mz,isotope_m1_intensity,isotope_apex_intensity,isotope_apex_offset,isotope_right_most_intensity,isotope_right_most_offset,isotope_m1_mz,isotope_apex_mz,isotope_right_most_mz
0,ACDEFHIK,Carbamidomethyl@C,2,1,8,1019.461492,0.53994,1.0,0,0.214538,2,1020.464792,1019.461492,1021.468092
1,APDEFMNIK,,,2,9,532.757692,0.56992,1.0,0,0.233061,2,533.259342,532.757692,533.760992
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3,20,808.337166,1.194615,1.194615,1,0.425938,3,808.671599,808.671599,809.340466


> Computing isotope patterns is very time-consuming for millions of peptides, so we provided `calc_precursor_isotope_mp` with multiprocessing for users.

## Fragment DataFrame

`alphabase.peptide.fragment.create_fragment_mz_dataframe()` is the only function we need to calculate fragment_mz dataframe. It has two key parameters:

- precursor_df (pd.DataFrame): the peptide or precursor dataframe.
- charged_frag_types (list of str): The charged fragments to be considered into the fragment dataframe columns. The schema is `Type[_LossType]_z[n]`, where 
  - `Type` can be `b,y,c,z`
  - `_LossType` can be `_modloss,_H2O,_NH3`, this is optional.
  - `z[n]` is the charge state. If precursor charge is less than `n`, the corresponding mz will be set as zero.

In [4]:
from alphabase.peptide.fragment import create_fragment_mz_dataframe
frag_mz_df = create_fragment_mz_dataframe(
    df,
    charged_frag_types=['a_z1','b_z1','c_z1','b_z2','x_z1','y_z1', 'y_H2O_z1','z_z1']
)
frag_mz_df

Unnamed: 0,a_z1,b_z1,c_z1,b_z2,x_z1,y_z1,y_H2O_z1,z_z1
0,44.049476,72.04439,89.070939,0.0,974.403643,948.424379,930.413814,932.405655
1,204.080124,232.075039,249.101588,0.0,814.372995,788.39373,770.383165,772.375006
2,319.107067,347.101982,364.128531,0.0,699.346052,673.366787,655.356222,657.348063
3,448.14966,476.144575,493.171124,0.0,570.303458,544.324194,526.313629,528.30547
4,595.218074,623.212989,640.239538,0.0,423.235045,397.25578,379.245215,381.237056
5,732.276986,760.271901,777.29845,0.0,286.176133,260.196868,242.186303,244.178144
6,845.36105,873.355965,890.382514,0.0,173.092069,147.112804,129.102239,131.09408
7,44.049476,72.04439,89.070939,36.525833,1019.450259,993.470995,975.46043,977.45227
8,141.102239,169.097154,186.123703,85.052215,922.397495,896.418231,878.407666,880.399507
9,256.129183,284.124097,301.150646,142.565687,807.370552,781.391288,763.380723,765.372564


After `create_fragment_mz_dataframe()`, two columns `frag_start_idx` and `frag_stop_idx` will be append to the peptide dataframe. These two values locate the fragment in the fragment dataframe of a peptide. 

In [5]:
df[[
    'sequence','mods','mod_sites','charge','nAA',
    'precursor_mz','frag_start_idx','frag_stop_idx'
]]

Unnamed: 0,sequence,mods,mod_sites,charge,nAA,precursor_mz,frag_start_idx,frag_stop_idx
0,ACDEFHIK,Carbamidomethyl@C,2,1,8,1019.461492,0,7
1,APDEFMNIK,,,2,9,532.757692,7,15
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3,20,808.337166,15,34


In [6]:
start,stop = df[['frag_start_idx','frag_stop_idx']].values[0] #first peptide
frag_mz_df.iloc[start:stop]

Unnamed: 0,a_z1,b_z1,c_z1,b_z2,x_z1,y_z1,y_H2O_z1,z_z1
0,44.049476,72.04439,89.070939,0.0,974.403643,948.424379,930.413814,932.405655
1,204.080124,232.075039,249.101588,0.0,814.372995,788.39373,770.383165,772.375006
2,319.107067,347.101982,364.128531,0.0,699.346052,673.366787,655.356222,657.348063
3,448.14966,476.144575,493.171124,0.0,570.303458,544.324194,526.313629,528.30547
4,595.218074,623.212989,640.239538,0.0,423.235045,397.25578,379.245215,381.237056
5,732.276986,760.271901,777.29845,0.0,286.176133,260.196868,242.186303,244.178144
6,845.36105,873.355965,890.382514,0.0,173.092069,147.112804,129.102239,131.09408


Note that all N-term (a/b/c) fragment mz values are in ascending order, e.g. from b[1] to b[n-1]; and all C-term (x/y/z) fragments are in descending order, e.g. from y[n-1] to y[1].

All dataframe functionalities use low-level APIs of AlphaBase, see `tutorial_dev_basic_definations.ipynb` or `Tutorial for Dev: Basic Definations`. 

Spectral library functionalities provide higher-level APIs which encapsulate these dataframe functionalities, see `tutorial_dev_spectral_libraries.ipynb` or `Tutorial for Dev: Spectral Libraries`.