In [None]:
%reload_ext autoreload
%autoreload 2

## The constants module (alphabase.constants)

The `alphabase.constants` module contains chemical element (`alphabase.constants.element.py`), amino acid (`alphabase.constants.aa.py`) and modification (`alphabase.constants.modification.py`) information, which are basic information for all alphabase functionalities. Check `alphabase/constants/nist_element.yaml`, `alphabase/constants/amino_acid.yaml`, `alphabase/constants/used_mod.yaml` for details.


## The peptide module (alphabase.peptide)

The `peptide` module contains peptide-related functionalities.

### alphabase.peptide.precursor

There are two functions in `alphabase.peptide.precursor`: 

- `refine_precursor_df(df)`. Sort df by 'nAA' and `reset_index()` inplace. For faster precursor and fragment mass calculation.
- `update_precursor_mz(df)`/`calc_precursor_mz(df)`. The dataframe must contain `sequence, mods, mod_sites, charge` columns.
- `hash_precursor_df(df, seed=0)`. It generates int64-based unique identity for the modified peptide (`sequence`, `mods`, and `mod_sites`) in `mod_seq_hash` column. It will also generates int64-based unique identity for the precursor (`sequence`, `mods`, `mod_sites`, and `charge`) in the `mod_seq_charge_hash` column.

In [None]:
import alphabase.peptide.precursor as precursor_utils
import pandas as pd

df = pd.DataFrame()
df['sequence'] = ['ACDEFHMK']*3
df['mods'] = ['',
    'Acetyl@Protein N-term', 
    'Carbamidomethyl@C;Oxidation@M'
]
df['mod_sites'] = ['','0','2;7']
df['charge'] = [1,1,2]

precursor_utils.update_precursor_mz(df)
precursor_utils.hash_precursor_df(df)
df

Unnamed: 0,sequence,mods,mod_sites,charge,nAA,precursor_mz,mod_seq_hash,mod_seq_charge_hash
0,ACDEFHMK,,,1,8,980.39645,-3613625140520963925,-3613625140520963924
1,ACDEFHMK,Acetyl@Protein N-term,0,1,8,1022.407015,-4469340833947450,-4469340833947449
2,ACDEFHMK,Carbamidomethyl@C;Oxidation@M,2;7,2,8,527.210052,4151334950393863927,4151334950393863929


### alphabase.peptide.mobility

The `mobility` module provides functionalities converting CCS values into mobility values and vice versa.

In [None]:
import alphabase.peptide.mobility as mobility
df = pd.DataFrame()
df['sequence'] = ['ACDEFHMK']*3
df['mods'] = ['',
    'Acetyl@Protein N-term', 
    'Carbamidomethyl@C;Oxidation@M'
]
df['mod_sites'] = ['','0','2;7']
df['charge'] = [1,1,2]
df['ccs_pred'] = 300 # if we predict ccs values
df['mobility_whatever'] = 1.0

df['mobility_pred'] = mobility.ccs_to_mobility_for_df(
    df, 'ccs_pred'
)
df['ccs_whatever'] = mobility.mobility_to_ccs_for_df(
    df, 'mobility_whatever'
)
df

Unnamed: 0,sequence,mods,mod_sites,charge,ccs_pred,mobility_whatever,nAA,precursor_mz,mobility_pred,ccs_whatever
0,ACDEFHMK,,,1,300,1.0,8,980.39645,1.477183,203.089245
1,ACDEFHMK,Acetyl@Protein N-term,0,1,300,1.0,8,1022.407015,1.478027,202.973356
2,ACDEFHMK,Carbamidomethyl@C;Oxidation@M,2;7,2,300,1.0,8,527.210052,0.739312,405.78241


### alphabase.peptide.fragment

This is the most comprehensive module in alphabase. The most important function is `create_fragment_mz_dataframe()`, which creates fragment dataframe for the given precursor_df and insert `frag_start_idx` and `frag_end_idx` in the precursor_df.

In [None]:
import alphabase.peptide.fragment as fragment

df = pd.DataFrame()
df['sequence'] = ['ACDEFHMK']*3
df['mods'] = ['',
    'Acetyl@Protein N-term', 
    'Carbamidomethyl@C;Oxidation@M'
]
df['mod_sites'] = ['','0','2;7']
df['charge'] = [1,1,2]
frag_types = fragment.get_charged_frag_types(['b','y'], 2)

frag_df = fragment.create_fragment_mz_dataframe(
    df, frag_types
)
assert 'frag_start_idx' in df.columns and 'frag_end_idx' in df.columns
frag_df

Unnamed: 0,b_z1,b_z2,y_z1,y_z2
0,72.04439,36.525833,909.359336,455.183306
1,175.053575,88.030426,806.350151,403.678714
2,290.080518,145.543897,691.323208,346.165242
3,419.123111,210.065194,562.280615,281.643946
4,566.191525,283.599401,415.212201,208.109739
5,703.250437,352.128857,278.153289,139.580283
6,834.290922,417.649099,147.112804,74.06004
7,114.054955,57.531116,909.359336,455.183306
8,217.06414,109.035708,806.350151,403.678714
9,332.091083,166.54918,691.323208,346.165242


Note that all fragment ions are stored from peptide N-term to C-term, so the b-ions are in the ascending order and y-ions are in the decending order.

#### Performance for fragment mz calculation

For large number of precursors in the precursor dataframe, we should use `create_fragment_mz_dataframe_by_sort_precursor()`. It is similar to `create_fragment_mz_dataframe()` but will change the order of the input precursor_df.

## alphabase.io.hdf

It defines functionalities to operate HDF files. We use it to create spectrum libraries.

## alphabase.spectrum_library

Spectrum library functionalities.

### alphabase.spectrum_library.library_base

The base structure (`SpecLibBase` class) of the spectrum library in alphabase. The main functions of this class are to save (`SpecLibBase.save_hdf()`) and load (`SpecLibBase.load_hdf()`) library into/from hdf file, it will also generate `mod_seq_hash` and `mod_seq_charge_hash` for the precurosr_df.

In [None]:
import alphabase.spectrum_library.library_base as lib_base
import numpy as np


import tempfile
import os
TEMPDIR = tempfile.gettempdir()
hdf_path = os.path.join(TEMPDIR, "alphabase_speclib.hdf")

hdflib = lib_base.SpecLibBase(['b_z1','y_z1'])

df = pd.DataFrame()
df['sequence'] = ['ACDEFHMK']*3
df['mods'] = ['',
    'Acetyl@Protein N-term', 
    'Carbamidomethyl@C;Oxidation@M'
]
df['mod_sites'] = ['','0','2;7']
df['charge'] = [1,1,2]

frag_df = fragment.create_fragment_mz_dataframe(df, ['b_z1','y_z1'])

hdflib.precursor_df = df
hdflib.fragment_mz_df = frag_df
hdflib.fragment_intensity_df = pd.DataFrame()
hdflib.save_hdf(hdf_path)

#test load_hdf
_hdflib = lib_base.SpecLibBase(['b_z1','y_z1'])
_hdflib.load_hdf(hdf_path, load_mod_seq=True)
assert all(_hdflib.precursor_df.sequence.values==df.sequence.values)
assert all(_hdflib.precursor_df.charge.values==df.charge.values)
assert np.allclose(_hdflib.fragment_mz_df.values, frag_df)

As we have `mod_seq_hash` in the precursor_df, we split precursor_df into two sub dataframes in the hdf file. 

One is still `precursor_df`, but we remove all str columns from it:

In [None]:
_hdflib.load_hdf(hdf_path)
_hdflib.precursor_df

Unnamed: 0,charge,frag_end_idx,frag_start_idx,mod_seq_charge_hash,mod_seq_hash,nAA
0,1,7,0,-3613625140520963924,-3613625140520963925,8
1,1,14,7,-4469340833947449,-4469340833947450,8
2,2,21,14,4151334950393863929,4151334950393863927,8


The other is `mod_seq_df`, all str columns are moved into this df, as well as the hash columns:

In [None]:
mod_seq_df = _hdflib.load_df_from_hdf(hdf_path, 'mod_seq_df')
mod_seq_df

Unnamed: 0,mod_seq_charge_hash,mod_seq_hash,mod_sites,mods,sequence
0,-3613625140520963924,-3613625140520963925,,,ACDEFHMK
1,-4469340833947449,-4469340833947450,0,Acetyl@Protein N-term,ACDEFHMK
2,4151334950393863929,4151334950393863927,2;7,Carbamidomethyl@C;Oxidation@M,ACDEFHMK


We can then easily merge precursor_df and mod_seq_df by `mod_seq_charge_hash` to create the original precursor dataframe:

In [None]:
del mod_seq_df['mod_seq_hash']
_hdflib.precursor_df = _hdflib.precursor_df.merge(mod_seq_df, on='mod_seq_charge_hash')
_hdflib.precursor_df

Unnamed: 0,charge,frag_end_idx,frag_start_idx,mod_seq_charge_hash,mod_seq_hash,nAA,mod_sites,mods,sequence
0,1,7,0,-3613625140520963924,-3613625140520963925,8,,,ACDEFHMK
1,1,14,7,-4469340833947449,-4469340833947450,8,0,Acetyl@Protein N-term,ACDEFHMK
2,2,21,14,4151334950393863929,4151334950393863927,8,2;7,Carbamidomethyl@C;Oxidation@M,ACDEFHMK


... and it is extactly the same with the original one.

In [None]:
assert all(_hdflib.precursor_df.sequence.values==df.sequence.values)
assert all(_hdflib.precursor_df.charge.values==df.charge.values)
assert np.allclose(_hdflib.fragment_mz_df.values, frag_df)

## alphabase.io.psm_reader


The last part of alphabase, to be continue ...