# Tutorial: Spectral Libraries

This notebook introduces functionalities for spectral libraries to developers.

## The Base Library Class

`alphabase.spectral_library.base.SpecLibBase` is the base class for spectral libraries. See https://alphabase.readthedocs.io/en/latest/ for details. We recommend users to access spectral library functionalities via `alphabase.protein.fasta.SpecLibFasta`. 

## `SpecLibFasta`

Almost all DataFrame functionalities to process proteins and peptides have been integrated into `alphabase.protein.fasta.SpecLibFasta`. 

In [1]:
from alphabase.protein.fasta import SpecLibFasta

fasta_lib = SpecLibFasta(
    charged_frag_types=['b_z1','y_z1'],
    protease='trypsin',
    fix_mods=['Carbamidomethyl@C'],
    var_mods=['Acetyl@Protein_N-term','Oxidation@M'],
    decoy=None,
)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


### Start from fasta/proteins

The SpecLibFasta will do following for us:

- Load fasta files into a protein_dict
- Digest proteins into peptide sequences
- Append decoy peptide sequences if self.decoy is not None
- Add fixed and variable modifications
- [Add special modifications]
- [Add peptide labeling]
- Add charge states to peptides

#### Load fasta files into a protein_dict

In [2]:
# from alphabase.protein.fasta import load_all_proteins
# protein_dict = load_all_proteins(fasta_files)

# For example, the protein_dict is:
protein_dict = {
    'yy': {
        'protein_id': 'yy',
        'full_name': 'yy_yy',
        'gene_name': 'y_y',
        'sequence': 'FGHIKLMNPQR'
    },
    'xx': {
        'protein_id': 'xx',
        'full_name': 'xx_xx',
        'gene_name': 'x_x',
        'sequence': 'MACDESTYKXKFGHIKLMNPQRST'
    },
}

#### Digest proteins into peptide sequences

In [3]:
fasta_lib.get_peptides_from_protein_dict(protein_dict)
fasta_lib.precursor_df

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA
0,XKFGHIK,1,1,False,False,,,7
1,LMNPQRST,1,1,False,True,,,8
2,ACDESTYK,1,0,True,False,,,8
3,MACDESTYK,1,0,True,False,,,9
4,ACDESTYKXK,1,1,True,False,,,10
5,FGHIKLMNPQR,0;1,1,True,True,,,11
6,MACDESTYKXK,1,1,True,False,,,11
7,XKFGHIKLMNPQR,1,2,False,False,,,13
8,FGHIKLMNPQRST,1,2,False,True,,,13
9,ACDESTYKXKFGHIK,1,2,True,False,,,15


In [4]:
fasta_lib.protein_df

Unnamed: 0,protein_id,full_name,gene_name,sequence
0,yy,yy_yy,y_y,FGHIKLMNPQR
1,xx,xx_xx,x_x,MACDESTYKXKFGHIKLMNPQRST


#### Append decoy sequences

This depends on self.decoy:str, its value can be 

- `protein_reverse`: Reverse on target protein sequences
- `pseudo_reverse`: Pseudo-reverse on target peptide sequences
- `diann`: DiaNN-like decoy
- None: no decoy. 
            
Let's take `diann` as an example:

In [5]:
fasta_lib.decoy = 'diann'
fasta_lib.append_decoy_sequence()
fasta_lib.precursor_df.sample(5, random_state=0)

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA,decoy
20,MACDESTYKXKFGHIK,1,2,True,False,,,16,0
10,FGHIKLMNPQR,0;1,1,True,True,,,11,0
14,FLHIKLMNPQRTT,1,2,False,True,,,13,1
13,FLHIKLMNPNR,0;1,1,True,True,,,11,1
1,XLFGHVK,1,1,False,False,,,7,1


#### Add modifications

`add_modifications()` will add fixed and variable modifications. 

In [6]:
fasta_lib.add_modifications()
fasta_lib.precursor_df.sample(5, random_state=0)

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA,decoy
35,FLHIKLMNPNR,0;1,1,True,True,Acetyl@Protein_N-term;Oxidation@M,0;7,11,1
34,FLHIKLMNPNR,0;1,1,True,True,,,11,1
41,FGHIKLMNPQRST,1,2,False,True,Oxidation@M,7,13,0
27,MACDESTYKXK,1,1,True,False,Acetyl@Protein_N-term;Oxidation@M;Carbamidomet...,0;1;3,11,0
11,MACDESTYK,1,0,True,False,Acetyl@Protein_N-term;Oxidation@M;Carbamidomet...,0;1;3,9,0


#### Add special modifications

Special modifications here refer to some PTMs we want to have more controls on:

1. We only needs peptides without unmodified forms
2. `GlyGly@K` cannot occur on peptide C-term because trypsin cannot cleave Lys with `GlyGly`
3. For some special modifications like `Phospho@S` and `HexNAc@S`, we would like to limit the number of peptidome forms to control the memory usage.

In [7]:
fasta_lib.special_mods = ['GlyGly@K']
fasta_lib.special_mods_cannot_modify_pep_c_term = True
fasta_lib.min_special_mod_num = 1 # exclude the unmodified forms
fasta_lib.max_special_mod_num = 1 # limit the number of 
fasta_lib.add_special_modifications()
fasta_lib.precursor_df.sample(10, random_state=0)

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA,decoy
45,MACDESTYKXKFGHIK,1,2,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C;GlyGly@K,0;3;9,16,0
33,ASDESTYKXKFGHVK,1,2,True,False,Acetyl@Protein_N-term;GlyGly@K,0;8,15,1
40,MACDESTYKXKFGHIK,1,2,True,False,Oxidation@M;Carbamidomethyl@C;GlyGly@K,1;3;11,16,0
26,FGHIKLMNPQRST,1,2,False,True,GlyGly@K,5,13,0
11,MACDESTYKXK,1,1,True,False,Acetyl@Protein_N-term;Oxidation@M;Carbamidomet...,0;1;3;9,11,0
2,ACDESTYKXK,1,1,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C;GlyGly@K,0;2;8,10,0
32,ASDESTYKXKFGHVK,1,2,True,False,GlyGly@K,10,15,1
43,MACDESTYKXKFGHIK,1,2,True,False,Acetyl@Protein_N-term;Oxidation@M;Carbamidomet...,0;1;3;9,16,0
46,MACDESTYKXKFGHIK,1,2,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C;GlyGly@K,0;3;11,16,0
30,XKFGHIKLMNPQR,1,2,False,False,GlyGly@K,7,13,0


#### Add peptide labeling

For example Dimethyl:

In [8]:
fasta_lib.labeling_channels = {
    0: ['Dimethyl@K', 'Dimethyl@Any_N-term'],
    4: ['Dimethyl:2H(4)@K', 'Dimethyl:2H(4)@Any_N-term'],
}
fasta_lib.add_peptide_labeling()
fasta_lib.precursor_df.sample(10, random_state=0)

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA,decoy,labeling_channel
85,XKFGHIKLMNPQR,1,2,False,False,GlyGly@K;Dimethyl:2H(4)@Any_N-term;Dimethyl:2H...,7;0;2;7,13,0,4
10,MACDESTYKXK,1,1,True,False,Carbamidomethyl@C;GlyGly@K;Dimethyl@Any_N-term...,3;9;0;9;11,11,0,0
75,FLHIKLMNPNR,0;1,1,True,True,Acetyl@Protein_N-term;GlyGly@K;Dimethyl:2H(4)@K,0;5;5,11,1,4
2,ACDESTYKXK,1,1,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C;GlyGly...,0;2;8;8;10,10,0,0
24,XLFGHIKLMNPNR,1,2,False,False,GlyGly@K;Dimethyl@Any_N-term;Dimethyl@K,7;0;7,13,1,0
101,MACDESTYKXKFGHIK,1,2,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C;GlyGly...,0;3;11;9;11;16,16,0,4
109,MLCDESTYKXKFGHVK,1,2,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C;GlyGly...,0;3;11;9;11;16,16,1,4
7,FGHIKLMNPQR,0;1,1,True,True,Acetyl@Protein_N-term;Oxidation@M;GlyGly@K;Dim...,0;7;5;5,11,0,0
16,MLCDESTYKVK,1,1,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C;GlyGly...,0;3;9;9;11,11,1,0
91,ACDESTYKXKFGHIK,1,2,True,False,Carbamidomethyl@C;GlyGly@K;Dimethyl:2H(4)@Any_...,2;10;0;8;10;15,15,0,4


#### Add charge states

In [9]:
fasta_lib.add_charge()
fasta_lib.precursor_df.sample(5, random_state=0)

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA,decoy,labeling_channel,charge
122,MACDESTYKXKFGHIK,1,2,True,False,Oxidation@M;Carbamidomethyl@C;GlyGly@K;Dimethy...,1;3;11;0;9;11;16,16,0,0,4
66,FLHIKLMNPQRTT,1,2,False,True,GlyGly@K;Dimethyl@Any_N-term;Dimethyl@K,5;0;5,13,1,0,2
142,MLCDESTYKXKFGHVK,1,2,True,False,Oxidation@M;Carbamidomethyl@C;GlyGly@K;Dimethy...,1;3;9;0;9;11;16,16,1,0,3
246,XKFGHIKLMNPQR,1,2,False,False,Oxidation@M;GlyGly@K;Dimethyl:2H(4)@Any_N-term...,9;2;0;2;7,13,0,4,2
146,MLCDESTYKXKFGHVK,1,2,True,False,Oxidation@M;Carbamidomethyl@C;GlyGly@K;Dimethy...,1;3;11;0;9;11;16,16,1,0,4


#### `import_and_process_protein_dict()` combines all steps

Or `import_and_process_fasta()` for fasta files.

In [10]:
fasta_lib.special_mods = []
fasta_lib.labeling_channels = None
fasta_lib.import_and_process_protein_dict(protein_dict)
fasta_lib.protein_df

Unnamed: 0,protein_id,full_name,gene_name,sequence
0,yy,yy_yy,y_y,FGHIKLMNPQR
1,xx,xx_xx,x_x,MACDESTYKXKFGHIKLMNPQRST


In [11]:
fasta_lib.precursor_df

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA,decoy,charge,precursor_mz
0,LMNPQRST,1,1,False,True,Oxidation@M,2,8,0,2,481.739834
1,LMNPQRST,1,1,False,True,,,8,0,2,473.742377
2,ACDESTYK,1,0,True,False,Carbamidomethyl@C,2,8,0,2,487.200207
3,ACDESTYK,1,0,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C,0;2,8,0,2,508.20549
4,LLNPQRTT,1,1,False,True,,,8,1,2,471.771991
5,ASDESTSK,1,0,True,False,,,8,1,2,412.685247
6,ASDESTSK,1,0,True,False,Acetyl@Protein_N-term,0,8,1,2,433.690529
7,MACDESTYK,1,0,True,False,Oxidation@M;Carbamidomethyl@C,1;3,9,0,2,560.717907
8,MACDESTYK,1,0,True,False,Carbamidomethyl@C,3,9,0,2,552.72045
9,MACDESTYK,1,0,True,False,Acetyl@Protein_N-term;Oxidation@M;Carbamidomet...,0;1;3,9,0,2,581.72319


### Start from peptides instead of proteins

The modularity design of `SpecLibFasta` allows us to starts from arbitrary types of peptide inputs, meaning that fasta files or protein_dict is not necessary.

For example, we have a list of sequences, and we what to add modifications using `SpecLibFasta` functionalities:

In [12]:
import pandas as pd
pep_lib = SpecLibFasta(
    charged_frag_types=['b_z1','y_z1'],
    fix_mods=['Carbamidomethyl@C'],
    var_mods=['Acetyl@Protein_N-term','Oxidation@M'],
    labeling_channels={
        0: ['Dimethyl@K', 'Dimethyl@Any_N-term'],
        4: ['Dimethyl:2H(4)@K', 'Dimethyl:2H(4)@Any_N-term'],
    },
    decoy=None,
)

pep_lib.precursor_df = pd.DataFrame({
    'sequence': ['ABCDEFG','HIJKLMN','OPQRST','UVWXYZ']
})
pep_lib.process_from_naked_peptide_seqs()
pep_lib.precursor_df

Unnamed: 0,sequence,nAA,is_prot_nterm,is_prot_cterm,mods,mod_sites,labeling_channel,charge,precursor_mz
0,OPQRST,6,False,False,Dimethyl@Any_N-term,0,0,2,427.248152
1,HIJKLMN,7,False,False,Oxidation@M;Dimethyl@Any_N-term;Dimethyl@K,6;0;4,0,2,470.786056
2,HIJKLMN,7,False,False,Dimethyl@Any_N-term;Dimethyl@K,0;4,0,2,462.788599
3,OPQRST,6,False,False,Dimethyl:2H(4)@Any_N-term,0,4,2,429.260705
4,HIJKLMN,7,False,False,Oxidation@M;Dimethyl:2H(4)@Any_N-term;Dimethyl...,6;0;4,4,2,474.811163
5,HIJKLMN,7,False,False,Dimethyl:2H(4)@Any_N-term;Dimethyl:2H(4)@K,0;4,4,2,466.813706


### Calculate masses

#### Calculate precursor m/z

In [13]:
fasta_lib.calc_precursor_mz()
# fasta_lib.calc_precursor_isotope()
fasta_lib.precursor_df

Unnamed: 0,sequence,protein_idxes,miss_cleavage,is_prot_nterm,is_prot_cterm,mods,mod_sites,nAA,decoy,charge,precursor_mz
0,LMNPQRST,1,1,False,True,Oxidation@M,2,8,0,2,481.739834
1,LMNPQRST,1,1,False,True,,,8,0,2,473.742377
2,ACDESTYK,1,0,True,False,Carbamidomethyl@C,2,8,0,2,487.200207
3,ACDESTYK,1,0,True,False,Acetyl@Protein_N-term;Carbamidomethyl@C,0;2,8,0,2,508.20549
4,LLNPQRTT,1,1,False,True,,,8,1,2,471.771991
5,ASDESTSK,1,0,True,False,,,8,1,2,412.685247
6,ASDESTSK,1,0,True,False,Acetyl@Protein_N-term,0,8,1,2,433.690529
7,MACDESTYK,1,0,True,False,Oxidation@M;Carbamidomethyl@C,1;3,9,0,2,560.717907
8,MACDESTYK,1,0,True,False,Carbamidomethyl@C,3,9,0,2,552.72045
9,MACDESTYK,1,0,True,False,Acetyl@Protein_N-term;Oxidation@M;Carbamidomet...,0;1;3,9,0,2,581.72319


After `calc_precursor_mz()`, all sequences containing `x` are removed because `x`'s mass is very large which is out of the range of `fasta_lib.min_precursor_mz` and `fasta_lib.max_precursor_mz`.

In [14]:
from alphabase.constants.aa import AA_ASCII_MASS
(
    fasta_lib.min_precursor_mz, fasta_lib.max_precursor_mz, 
    f"mass of 'x' is {AA_ASCII_MASS[ord('x')]}"
)

(400.0, 2000.0, "mass of 'x' is 100000000.0")

#### Calculate fragment m/z

In [15]:
fasta_lib.calc_fragment_mz_df()
fasta_lib.fragment_mz_df

Unnamed: 0,b_z1,y_z1
0,114.091339,849.388306
1,261.126740,702.352905
2,375.169678,588.309998
3,472.222443,491.257233
4,600.281006,363.198669
...,...,...
486,941.502563,588.309998
487,1038.555298,491.257233
488,1166.613892,363.198669
489,1322.714966,207.097549


Use `frag_start_idx` and `frag_stop_idx` in precursor_df to locate the corresponding fragments

In [16]:
ith_pep = 5
frag_start, frag_stop = fasta_lib.precursor_df.loc[ith_pep,['frag_start_idx','frag_stop_idx']].values
fasta_lib.fragment_mz_df.iloc[frag_start:frag_stop]

Unnamed: 0,b_z1,y_z1
35,72.044388,753.326111
36,159.076416,666.294067
37,274.103363,551.267151
38,403.145966,422.224548
39,490.177979,335.192505
40,591.225647,234.144836
41,678.25769,147.112808
