# Tutorial for Dev: Peptide and Fragment DataFrames

This notebook introduces functionalities for peptide and fragment DataFrames to developers.

## Peptide DataFrame

Peptide dataframe must contain four columns: `sequence` for animo acid sequence (str), `mods` for modification names (str), `mod_sites` for modification sites (str), and `charge` for precursor charge states (int).

We can easily build a peptide dataframe:

In [1]:
import pandas as pd

df = pd.DataFrame({
    'sequence': ['ACDEFHIK', 'APDEFMNIK', 'SWDEFMNTIRAAAAKDDDDR'],
    'mods': ['Carbamidomethyl@C', '', 'Phospho@S;Oxidation@M'],
    'mod_sites': ['2', '', '1;6'],
    'charge': [1,2,3],
})
df

Unnamed: 0,sequence,mods,mod_sites,charge
0,ACDEFHIK,Carbamidomethyl@C,2,1
1,APDEFMNIK,,,2
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3


### Calculate precursor_mz and isotopes from peptide dataframe

`alphabase.peptide.precursor.update_precursor_mz()` calculates the precursor_mz for peptides.

In [2]:
from alphabase.peptide.precursor import update_precursor_mz

update_precursor_mz(df)

Unnamed: 0,sequence,mods,mod_sites,charge,nAA,precursor_mz
0,ACDEFHIK,Carbamidomethyl@C,2,1,8,1019.461492
1,APDEFMNIK,,,2,9,532.757692
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3,20,808.337166


`alphabase.peptide.precursor.calc_precursor_isotope()` calculates the precursor isotope information for peptides. It will add `i_*` columns for peptides.

In [3]:
from alphabase.peptide.precursor import calc_precursor_isotope

calc_precursor_isotope(df)

Unnamed: 0,sequence,mods,mod_sites,charge,nAA,precursor_mz,i_0,i_1,i_2,i_3,i_4,i_5,mono_isotope_idx
0,ACDEFHIK,Carbamidomethyl@C,2,1,8,1019.461492,0.54489,0.294208,0.1169,0.03434,0.008077,0.001584,0
1,APDEFMNIK,,,2,9,532.757692,0.527839,0.300826,0.123018,0.037359,0.009104,0.001854,0
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3,20,808.337166,0.271028,0.323775,0.225641,0.115441,0.047553,0.016561,0


> Computing isotope patterns is very time-consuming for millions of peptides, so we provided `calc_precursor_isotope_mp` with multiprocessing for users.

## Fragment DataFrame

`alphabase.peptide.fragment.create_fragment_mz_dataframe()` is the only function we need to calculate fragment_mz dataframe. It has two key parameters:

- precursor_df (pd.DataFrame): the peptide or precursor dataframe.
- charged_frag_types (list of str): The charged fragments to be considered into the fragment dataframe columns. The schema is `Type[_LossType]_z[n]`, where 
  - `Type` can be `b,y,c,z`
  - `_LossType` can be `_modloss,_H2O,_NH3`, this is optional.
  - `z[n]` is the charge state. If precursor charge is less than `n`, the corresponding mz will be set as zero.

In [4]:
from alphabase.peptide.fragment import create_fragment_mz_dataframe
frag_mz_df = create_fragment_mz_dataframe(
    df,
    charged_frag_types=['a_z1','b_z1','c_z1','b_z2','x_z1','y_z1', 'y_H2O_z1','z_z1']
)
frag_mz_df

Unnamed: 0,a_z1,b_z1,c_z1,b_z2,x_z1,y_z1,y_H2O_z1,z_z1
0,44.049477,72.044388,89.070938,0.0,974.403625,948.424377,930.413818,932.40564
1,204.080124,232.075043,249.101593,0.0,814.372986,788.393738,770.383179,772.375
2,319.107056,347.10199,364.12854,0.0,699.346069,673.36676,655.356201,657.348083
3,448.149658,476.144562,493.171112,0.0,570.303467,544.324219,526.31366,528.305481
4,595.218079,623.213013,640.239563,0.0,423.235046,397.255768,379.245209,381.237061
5,732.276978,760.271912,777.298462,0.0,286.176147,260.196869,242.18631,244.178146
6,845.361023,873.355957,890.382507,0.0,173.092072,147.112808,129.102234,131.094086
7,44.049477,72.044388,89.070938,36.525833,1019.450256,993.471008,975.460449,977.452271
8,141.102234,169.097153,186.123703,85.052216,922.397522,896.418213,878.407654,880.399536
9,256.129181,284.124084,301.150635,142.565689,807.370544,781.391296,763.380737,765.372559


After `create_fragment_mz_dataframe()`, two columns `frag_start_idx` and `frag_stop_idx` will be append to the peptide dataframe. These two values locate the fragment in the fragment dataframe of a peptide. 

In [5]:
df[[
    'sequence','mods','mod_sites','charge','nAA',
    'precursor_mz','frag_start_idx','frag_stop_idx'
]]

Unnamed: 0,sequence,mods,mod_sites,charge,nAA,precursor_mz,frag_start_idx,frag_stop_idx
0,ACDEFHIK,Carbamidomethyl@C,2,1,8,1019.461492,0,7
1,APDEFMNIK,,,2,9,532.757692,7,15
2,SWDEFMNTIRAAAAKDDDDR,Phospho@S;Oxidation@M,1;6,3,20,808.337166,15,34


In [6]:
start,stop = df[['frag_start_idx','frag_stop_idx']].values[0] #first peptide
frag_mz_df.iloc[start:stop]

Unnamed: 0,a_z1,b_z1,c_z1,b_z2,x_z1,y_z1,y_H2O_z1,z_z1
0,44.049477,72.044388,89.070938,0.0,974.403625,948.424377,930.413818,932.40564
1,204.080124,232.075043,249.101593,0.0,814.372986,788.393738,770.383179,772.375
2,319.107056,347.10199,364.12854,0.0,699.346069,673.36676,655.356201,657.348083
3,448.149658,476.144562,493.171112,0.0,570.303467,544.324219,526.31366,528.305481
4,595.218079,623.213013,640.239563,0.0,423.235046,397.255768,379.245209,381.237061
5,732.276978,760.271912,777.298462,0.0,286.176147,260.196869,242.18631,244.178146
6,845.361023,873.355957,890.382507,0.0,173.092072,147.112808,129.102234,131.094086


Note that all N-term (a/b/c) fragment mz values are in ascending order, e.g. from b[1] to b[n-1]; and all C-term (x/y/z) fragments are in descending order, e.g. from y[n-1] to y[1].

All dataframe functionalities use low-level APIs of AlphaBase, see `tutorial_dev_basic_definations.ipynb` or `Tutorial for Dev: Basic Definations`. 

Spectral library functionalities provide higher-level APIs which encapsulate these dataframe functionalities, see `tutorial_dev_spectral_libraries.ipynb` or `Tutorial for Dev: Spectral Libraries`.