# Using peptdeep for MHC class I immunopeptidomics

This notebook introduces how to generate spectral libraries for immunopeptidomics analysis from a list of protein sequences. This entails several steps:

1. unspecific digestion of protein sequences
2. selection of peptide sequences used for library prediction by peptdeep-hla predicition
   2.1 using the pretrained model
   2.2 using an improved model by including a transfer learning step
3. spectral library prediction
4. matching the peptides back to the proteins (this can be done before or after library prediction or seach)  



Note that pydivsufsort package is not installed by peptdeep by default. Install by:
```
pip install "peptdeep[development,hla]"
```

Or install within jupyter notebook:

In [1]:
%pip install -q pydivsufsort

Note: you may need to restart the kernel to use updated packages.




## 1. Unspecific digestion in alphabase

The unspecific digestion workflow uses the longest common prefix (LCP) algorithm, which is based on suffix array data structure, has been proven to be very efficient for unspecific digestion [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-577]. Here we used `pydivsufsort`, a Python wrapper of a high-performance C library libdivsufsort [https://github.com/y-256/libdivsufsort], to facilitate LCP-based digestion.

This means, the digestion is performed on a single sequence of strings and retrives both the peptide sequence as well as the start and stop indices of the peptide within the complete sequence. Therefore, unspecific digestion in alphabase involves two steps:

1. concatenation of protein sequences into a single sequence
2. unspecific digestion



#### 1.1 Concatenate protein sequences into a single sequence

The protein sequences are concatenated into a single sequence. The sequences are seperated by a sentinel character, in this case '$', so that no peptides across proteins are formed. Note that the first and last sentinel characters are crutial as well.


In [2]:
def concat_sequences_for_nonspecific_digestion(seq_list, sep="$"):
    return sep + sep.join(seq_list) + sep

In [3]:
prot_seq_list = ["MABCDEKFGHIJKLMNOPQRST","FGHIJKLMNOPQR"]
cat_prot = concat_sequences_for_nonspecific_digestion(prot_seq_list, sep="$")
cat_prot

'$MABCDEKFGHIJKLMNOPQRST$FGHIJKLMNOPQR$'

The same can be done directly from a fasta: 
@ Feng do you have an example fasta somwhere? 

In [4]:
from peptdeep.hla.hla_utils import load_prot_df
fasta_path = "D:/Software/FASTA/Human/example.fasta"
fasta = load_prot_df(fasta_path)
fasta

Unnamed: 0,protein_id,full_name,gene_name,gene_org,description,sequence,nAA
tr|A0A024R161|A0A024R161_HUMAN,A0A024R161,tr|A0A024R161|A0A024R161_HUMAN,DNAJC25-GNG10,A0A024R161_HUMAN,tr|A0A024R161|A0A024R161_HUMAN Guanine nucleot...,MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCG...,153
tr|A0A024RAP8|A0A024RAP8_HUMAN,A0A024RAP8,tr|A0A024RAP8|A0A024RAP8_HUMAN,KLRC4-KLRK1,A0A024RAP8_HUMAN,"tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, iso...",MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKC...,216


In [5]:
from peptdeep.hla.hla_utils import cat_proteins
cat_fasta = cat_proteins(fasta['sequence'])
cat_fasta

'$MGAPLLSPGWGAGAAGRRWWMLLAPLLPALLLVRPAGALVEGLYCGTRDCYEVLGVSRSAGKAEIARAYRQLARRYHPDRYRPQPGDEGPGRTPQSAEEAFLLVATAYETLKVSQAAAELQQYCMQNACKDALLVGVPAGSNPFREPRSCALL$MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQKQRCPVVKSKCRENASPFFFCCFIAVAMGIRFIIMVTIWSAVFLNSLFNQEVQIPLTESYCGPCPKNWICYKNNCYQFFDESKNWYESQASCMSQNASLLKVYSKEDQDLLKLVKSYHWMGLVHIPTNGSWQWEDGSILSPNLLTIIEMQKGDCALYASSFKGYIENCSTPNTYICMQRTV$'

#### 1.2 Unspecific digestion

Use `alphabase.protein.lcp_digest.get_substring_indices` to get all non-redundant non-specific peptide sequences from the concatenated protein sequence. The digested peptide sequences are stored in a dataframe based on their start and stop indices in the concantenated protein sequence string. To save the RAM, the `peptdeep.hla` module works on start and stop indices instead of on peptide sequences directly. This will save about 8 times of the RAM for HLA-I peptides (length from 7 to 14, deomnstrated below). For a large protein sequence database, there will be millions of unspecific peptides, so working with strings is not feasible for a complete human fasta due to the requirements of extremely large RAM. (~ 70M unspecific sequences from the reviewed swissprot fasta require ~ 4-5 GB RAM already).

Using the get_substring_indices function we extract the start and stop indices of all peptide sequences between 7 and 14 aa (min_len, max_len) from the concatenated protein sequences. All peptides sequences are unique, guranteed by the LCP algorithm.

In [6]:
from alphabase.protein.lcp_digest import get_substring_indices
import pandas as pd
import sys

start_idxes, stop_idxes = get_substring_indices(
    cat_fasta, min_len=8, max_len=14, stop_char="$"
)
digest_pos_df = pd.DataFrame({
    "start_pos": start_idxes,
    "stop_pos": stop_idxes,
})
digest_pos_df

Unnamed: 0,start_pos,stop_pos
0,1,9
1,1,10
2,1,11
3,1,12
4,1,13
...,...,...
2438,361,370
2439,361,371
2440,362,370
2441,362,371


In [7]:
RAM_use_idxes = sys.getsizeof(digest_pos_df)*1e-6

The unspecific peptide sequences can be localted by the `start_pos` and `stop_pos`.

In [8]:
digest_pos_df["sequence"] = digest_pos_df[
    ["start_pos","stop_pos"]
].apply(lambda x: cat_fasta[slice(*x)], axis=1)
digest_pos_df

Unnamed: 0,start_pos,stop_pos,sequence
0,1,9,MGAPLLSP
1,1,10,MGAPLLSPG
2,1,11,MGAPLLSPGW
3,1,12,MGAPLLSPGWG
4,1,13,MGAPLLSPGWGA
...,...,...,...
2438,361,370,NTYICMQRT
2439,361,371,NTYICMQRTV
2440,362,370,TYICMQRT
2441,362,371,TYICMQRTV


In [9]:
RAM_use_seqs = sys.getsizeof(digest_pos_df["sequence"])*1e-6

In [10]:
f"seq RAM = {RAM_use_seqs:.5f} Mb, idxes RAM = {RAM_use_idxes:.5f}, ratio = {RAM_use_seqs/RAM_use_idxes:.5f}"

'seq RAM = 0.16621 Mb, idxes RAM = 0.01969, ratio = 8.44230'

## Selection of peptide sequences used for library prediction
The digest_prot_df contains all unspecifically digested peptide sequences between 7 and 14 aa generatable from the concatenated protein sequences. This list is reduced using a HLA1_Binding_Classifier from peptdeep.hla.hla_class1. Two different model architectures are available, an LSTM model (HLA_Class_I_LSTM) and a BERT model (HLA_Class_I_BERT). A pretrained model is only available for the LSTM model architecture.
The HLA1_Binding_Classifer can be used with a pretrained model, tuned with existing peptide data or trained from scratch. Training of a new model should be considered carefully and will not be covered in this tutorial.
   

### 2.2 Selection of peptide seqeuence candidates without transferlearning

Selection of peptide sequences for library predicition using the pretrained model can be done in a few steps. First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, we can use any kind of dataframe containing peptide sequences to predict how likely there are HLA peptides, the only requirement beeing that the column containing the peptides is called 'sequence'.


In [11]:
from peptdeep.hla.hla_class1 import HLA1_Binding_Classifier

model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
manual_prediction = model.predict(digest_pos_df)
manual_prediction


Unnamed: 0,start_pos,stop_pos,sequence,nAA,HLA_prob_pred
0,1,9,MGAPLLSP,8,0.239477
1,145,153,REPRSCAL,8,0.061692
2,146,154,EPRSCALL,8,0.137313
3,155,163,MGWIRGRR,8,0.056462
4,156,164,GWIRGRRS,8,0.001298
...,...,...,...,...,...
2438,112,126,KVSQAAAELQQYCM,14,0.243115
2439,317,331,NGSWQWEDGSILSP,14,0.021114
2440,79,93,DRYRPQPGDEGPGR,14,0.060635
2441,113,127,VSQAAAELQQYCMQ,14,0.355900


Next, we can filter the list based on the HLA_prob_pred. The higher the probability, the more likely it is for the peptide sequence to be present in a immunopeptidomics sample. It is not recommended to use a cut-off below 0.7 as this inflates the spectral library. It is rather recommended to use more conservative cut-offs. 

In [12]:
manual_prediction[manual_prediction['HLA_prob_pred'] > 0.7]

Unnamed: 0,start_pos,stop_pos,sequence,nAA,HLA_prob_pred
17,168,176,EMSEFHNY,8,0.793702
24,130,138,KDALLVGV,8,0.817415
31,137,145,VPAGSNPF,8,0.751329
37,170,178,SEFHNYNL,8,0.940019
67,181,189,KSDFSTRW,8,0.895964
...,...,...,...,...,...
2318,95,109,QSAEEAFLLVATAY,14,0.969541
2378,329,343,SPNLLTIIEMQKGD,14,0.756001
2382,5,19,LLSPGWGAGAAGRR,14,0.733784
2408,110,124,TLKVSQAAAELQQY,14,0.891976


As described above, directly using the sequences for classification can be memory intense for large lists of sequences. Thereby, the manual concatenation, unspecific digestion, predicition and filtering is only suggested for small sets of proteins or integration of selected sequences (e.g mutations, nuORFs etc.). This can be circumvented by directly predicting and filtering from a fasta using model.predict_from_proteins(). This executes the concatenation, unspecific digestion, predicition and filtering automatically in batches. Thereby the whole process can be done more efficient and be performed without a specialized computation infrastructure.

In [13]:
sequences = model.predict_from_proteins(fasta, prob_threshold=0.7)
sequences

100%|██████████| 1/1 [00:00<00:00,  1.20it/s]


Unnamed: 0,start_pos,stop_pos,nAA,HLA_prob_pred,sequence
0,168,176,8,0.793702,EMSEFHNY
1,130,138,8,0.817415,KDALLVGV
2,137,145,8,0.751329,VPAGSNPF
3,170,178,8,0.940019,SEFHNYNL
4,181,189,8,0.895964,KSDFSTRW
...,...,...,...,...,...
143,95,109,14,0.969541,QSAEEAFLLVATAY
144,329,343,14,0.756001,SPNLLTIIEMQKGD
145,5,19,14,0.733784,LLSPGWGAGAAGRR
146,110,124,14,0.891976,TLKVSQAAAELQQY


### 2.2 Selection of peptide seqeuence candidates with transferlearning

To perform transferlearning we need a list of peptide sequences we expect to be present in our sample. These peptides can be retrived from several different sources like DDA or directDIA search results. It is recommended to use at the very least 1000 sequences for transferlearning. The more sequences available the better the transferlearning step works. The model performance can be assessed after transferlearning and should be assessed before predicition. 

First, the Classifier model needs to be initialized and the pretrained model is loaded. Next, a protein dataframe is added, in this example the previousely loaded fasta file. The protein dataframe is used by the Classifier internaly to draw negative training data during model training and testing.

In [14]:
model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()
model.load_proteins(fasta)

Next, we load the peptide sequences wee use for transferlearning and split it into a training and testing dataset. This step is very important to assess the model performance after transferlearning. Here, we use the digest_pos_df generated above. As these are no immunopeptides, but a list of unspecifically digested proteins, the model performance will not improve, but the pronciples remain the same.  
@ Feng should we include a example file so that the model is actually improved or just use this? 

In [15]:
test_seq_df = digest_pos_df.sample(frac=0.2)
train_seq_df = digest_pos_df.drop(index=test_seq_df.index)
len(train_seq_df), len(test_seq_df)

(1954, 489)

Now, we train the model using the training sequence dataframe. In this example we use 10 training epochs, in a real experiment more should be used. Good starting points are 40 epochs for a training dataset of around 10000 sequences or 100 epochs for a training dataset of around 1000 sequences. For a real experiment the warmup_epochs can be increased to 10.  

In [16]:

model.train(train_seq_df,
            epoch=10, warmup_epoch=5, 
            verbose=True)

2024-07-22 09:21:38> Training with fixed sequence length: 0




[Training] Epoch=1, lr=2e-05, loss=1.415909733091082




[Training] Epoch=2, lr=4e-05, loss=1.0947138496807642




[Training] Epoch=3, lr=6e-05, loss=0.8823633790016174




[Training] Epoch=4, lr=8e-05, loss=0.7819523641041347




[Training] Epoch=5, lr=0.0001, loss=0.7255220583506993




[Training] Epoch=6, lr=0.0001, loss=0.705090846334185




[Training] Epoch=7, lr=9.045084971874738e-05, loss=0.7013667055538723




[Training] Epoch=8, lr=6.545084971874738e-05, loss=0.6968921593257359




[Training] Epoch=9, lr=3.4549150281252636e-05, loss=0.6968518495559692
[Training] Epoch=10, lr=9.549150281252633e-06, loss=0.6932548114231655




We can assess the model performance after transferlearning using the model.test() function on the training and testing data. This can also be done before transferlearning to assess how well the model fits the available data already. The test assesses the precision, recall and fals positive rate of the model at different probability cut offs. As a rule of thumb a false postitve rate above 7% (@FENG adjust in case lower/higher) is not recomendable because the peptide list gets disproportionally larger, leading to lower IDs during the search. In case of a high false postitive rate, the probability cut off at which the peptides are predicted should be increased.  

In [17]:
model.test(train_seq_df)

Unnamed: 0,HLA_prob_pred,precision,recall,false_positive
0,0.5,0.4964,0.599795,0.608495
1,0.6,0.622951,0.019447,0.011771
2,0.7,,0.0,0.0
3,0.8,,0.0,0.0
4,0.9,,0.0,0.0


In [18]:
model.test(test_seq_df)

Unnamed: 0,HLA_prob_pred,precision,recall,false_positive
0,0.5,0.480159,0.494888,0.535787
1,0.6,0.461538,0.01227,0.014315
2,0.7,,0.0,0.0
3,0.8,,0.0,0.0
4,0.9,,0.0,0.0


After transferlearning and testing the new model, peptides can be predicted as with the pretrained model. 

In [19]:
model.predict_from_proteins(fasta, prob_threshold=0.6)

100%|██████████| 1/1 [00:00<00:00,  1.20it/s]


Unnamed: 0,start_pos,stop_pos,nAA,HLA_prob_pred,sequence
0,143,151,8,0.60663,PFREPRSC
1,170,178,8,0.697908,SEFHNYNL
2,62,70,8,0.602259,KAEIARAY
3,87,95,8,0.611214,DEGPGRTP
4,299,307,8,0.611188,LLKLVKSY
5,346,354,8,0.62016,YASSFKGY
6,344,352,8,0.6017,ALYASSFK
7,223,231,8,0.605099,IMVTIWSA
8,258,266,8,0.618778,ICYKNNCY
9,363,371,8,0.602542,YICMQRTV


## Spectral library prediciton

Now the spectral library for the filtered peptide list can be predicted using PredictSpecLibFasta. First, one needs to select the models for rt/ccs/ms2 prediction using the ModelManager. One can select from a set of pretrained models or load externally trained models. Here we load the 'HLA' model (at the moment this still loads the generic model, but in the futer this is supposed to be replaced by an HLA specfic internal model). 

In [20]:
from peptdeep.spec_lib.predict_lib import  ModelManager
from peptdeep.protein.fasta import PredictSpecLibFasta

model_mgr = ModelManager()
model_mgr.load_installed_models(model_type='HLA')

In the next step, the PredictSpecLibFasta is initialized using the preloaded model. The presettings here are selected for the prediction of tryptic libraries so some parameters need to be adjusted, in particular precursor_charge_min, precursor_charge_max. By default Carbamidomethylation is set as a fixed modification (fix_mod) and Acetylation and Oxidation are set as variable modifications (var_mod). Those can be removed by adding an empty list as shown for the variable modifications. 

Of note, PredictSpecLibFasta can also be used to predict a library from a fasta file. Therfore one can also set the protease (default trypsin) and the minimum and maximum peptide length (7 to 35). Wee dont need to change those parameters here, as we wont make use of the digestion functions but rather provide a already digested sequence table. 


In [21]:
speclib = PredictSpecLibFasta(model_manager=model_mgr,
                              precursor_charge_min=1,
                              precursor_charge_max=3,
                              fix_mods=[])

To reduce the size of the dataframe and predicted library we give each peptide sequence a unique protein identifier (number). This enables the use of search engines that rely on protein information (such as AlphaDIA) but one needs to keep in mind to remove filtering steps based on how many peptides per protein are identified during data analysis. Alternatively, proteins the peptide sequences could originate from can be infered using prot_infer (demonstrated below).   

In [22]:
sequences['protein_id'] = [str(i) for i in range(len(sequences))]
sequences['protein_idxes'] = sequences.protein_id.astype("U")
sequences['full_name'] = sequences['protein_id'] 
sequences['gene_org'] = sequences['protein_id'] 
sequences['gene_name'] = sequences['protein_id']
sequences["is_prot_nterm"] = False
sequences["is_prot_cterm"] = False
sequences

Unnamed: 0,start_pos,stop_pos,nAA,HLA_prob_pred,sequence,protein_id,protein_idxes,full_name,gene_org,gene_name,is_prot_nterm,is_prot_cterm
0,168,176,8,0.793702,EMSEFHNY,0,0,0,0,0,False,False
1,130,138,8,0.817415,KDALLVGV,1,1,1,1,1,False,False
2,137,145,8,0.751329,VPAGSNPF,2,2,2,2,2,False,False
3,170,178,8,0.940019,SEFHNYNL,3,3,3,3,3,False,False
4,181,189,8,0.895964,KSDFSTRW,4,4,4,4,4,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
143,95,109,14,0.969541,QSAEEAFLLVATAY,143,143,143,143,143,False,False
144,329,343,14,0.756001,SPNLLTIIEMQKGD,144,144,144,144,144,False,False
145,5,19,14,0.733784,LLSPGWGAGAAGRR,145,145,145,145,145,False,False
146,110,124,14,0.891976,TLKVSQAAAELQQY,146,146,146,146,146,False,False


The sequence dataframe contains all the relevant information to be passed to the protein_df and the precursor_df.

In [23]:
speclib.protein_df = sequences[["sequence","protein_id","nAA", 'full_name', 'gene_org', 'gene_name']].copy()
speclib.protein_df

Unnamed: 0,sequence,protein_id,nAA,full_name,gene_org,gene_name
0,EMSEFHNY,0,8,0,0,0
1,KDALLVGV,1,8,1,1,1
2,VPAGSNPF,2,8,2,2,2
3,SEFHNYNL,3,8,3,3,3
4,KSDFSTRW,4,8,4,4,4
...,...,...,...,...,...,...
143,QSAEEAFLLVATAY,143,14,143,143,143
144,SPNLLTIIEMQKGD,144,14,144,144,144
145,LLSPGWGAGAAGRR,145,14,145,145,145
146,TLKVSQAAAELQQY,146,14,146,146,146


In [24]:
speclib.precursor_df = sequences[["sequence","protein_idxes","start_pos","stop_pos","nAA","HLA_prob_pred", 'is_prot_nterm', 'is_prot_cterm']].copy()
speclib.precursor_df

Unnamed: 0,sequence,protein_idxes,start_pos,stop_pos,nAA,HLA_prob_pred,is_prot_nterm,is_prot_cterm
0,EMSEFHNY,0,168,176,8,0.793702,False,False
1,KDALLVGV,1,130,138,8,0.817415,False,False
2,VPAGSNPF,2,137,145,8,0.751329,False,False
3,SEFHNYNL,3,170,178,8,0.940019,False,False
4,KSDFSTRW,4,181,189,8,0.895964,False,False
...,...,...,...,...,...,...,...,...
143,QSAEEAFLLVATAY,143,95,109,14,0.969541,False,False
144,SPNLLTIIEMQKGD,144,329,343,14,0.756001,False,False
145,LLSPGWGAGAAGRR,145,5,19,14,0.733784,False,False
146,TLKVSQAAAELQQY,146,110,124,14,0.891976,False,False


In [25]:
speclib.precursor_df

Unnamed: 0,sequence,protein_idxes,start_pos,stop_pos,nAA,HLA_prob_pred,is_prot_nterm,is_prot_cterm
0,EMSEFHNY,0,168,176,8,0.793702,False,False
1,KDALLVGV,1,130,138,8,0.817415,False,False
2,VPAGSNPF,2,137,145,8,0.751329,False,False
3,SEFHNYNL,3,170,178,8,0.940019,False,False
4,KSDFSTRW,4,181,189,8,0.895964,False,False
...,...,...,...,...,...,...,...,...
143,QSAEEAFLLVATAY,143,95,109,14,0.969541,False,False
144,SPNLLTIIEMQKGD,144,329,343,14,0.756001,False,False
145,LLSPGWGAGAAGRR,145,5,19,14,0.733784,False,False
146,TLKVSQAAAELQQY,146,110,124,14,0.891976,False,False


Next, the modifications and charges can be added to the peptide dataframe using add_modifications and add_charge. This creates a unique entry for every combination of charge and modification for all the sequences in the precursor dataframe. 

In [26]:
speclib.add_modifications()
speclib.add_charge()
speclib.precursor_df

Unnamed: 0,sequence,protein_idxes,start_pos,stop_pos,nAA,HLA_prob_pred,is_prot_nterm,is_prot_cterm,mods,mod_sites,charge
0,EMSEFHNY,0,168,176,8,0.793702,False,False,Oxidation@M,2,1
1,EMSEFHNY,0,168,176,8,0.793702,False,False,Oxidation@M,2,2
2,EMSEFHNY,0,168,176,8,0.793702,False,False,Oxidation@M,2,3
3,EMSEFHNY,0,168,176,8,0.793702,False,False,,,1
4,EMSEFHNY,0,168,176,8,0.793702,False,False,,,2
...,...,...,...,...,...,...,...,...,...,...,...
493,TLKVSQAAAELQQY,146,110,124,14,0.891976,False,False,,,2
494,TLKVSQAAAELQQY,146,110,124,14,0.891976,False,False,,,3
495,LSPGWGAGAAGRRW,147,6,20,14,0.842583,False,False,,,1
496,LSPGWGAGAAGRRW,147,6,20,14,0.842583,False,False,,,2


Now ccs, rt and ms2 can be predicted for each entry

In [27]:
speclib.predict_all()

2024-07-22 09:22:23> Predicting RT/IM/MS2 for 400 precursors ...
2024-07-22 09:22:23> Predicting RT ...


100%|██████████| 7/7 [00:00<00:00, 69.31it/s]

2024-07-22 09:22:23> Predicting mobility ...



100%|██████████| 7/7 [00:00<00:00, 72.89it/s]

2024-07-22 09:22:23> Predicting MS2 ...



100%|██████████| 7/7 [00:00<00:00, 22.52it/s]

2024-07-22 09:22:24> End predicting RT/IM/MS2





iRTs can be added using translate_rt_to_irt_pred. This is not neccessary for search engines like DIA-NN or AlphaDIA but required for Spectronaut.

In [28]:
speclib.translate_rt_to_irt_pred()

Predict RT for 11 iRT precursors.
Linear regression of `rt_pred` to `irt`:
   R_square         R       slope  intercept  test_num
0   0.99007  0.995022  152.235621  -39.23216        11


Unnamed: 0,sequence,protein_idxes,start_pos,stop_pos,nAA,HLA_prob_pred,is_prot_nterm,is_prot_cterm,mods,mod_sites,...,precursor_mz,rt_pred,rt_norm_pred,ccs_pred,mobility_pred,nce,instrument,frag_start_idx,frag_stop_idx,irt_pred
0,EMSEFHNY,0,168,176,8,0.793702,False,False,Oxidation@M,2,...,1072.404037,0.189650,0.189650,254.195923,1.253140,30.0,Lumos,0,7,-10.360729
1,EMSEFHNY,0,168,176,8,0.793702,False,False,Oxidation@M,2,...,536.705657,0.189650,0.189650,337.328583,0.831494,30.0,Lumos,7,14,-10.360729
2,EMSEFHNY,0,168,176,8,0.793702,False,False,,,...,1056.409123,0.289261,0.289261,255.103760,1.257373,30.0,Lumos,14,21,4.803679
3,EMSEFHNY,0,168,176,8,0.793702,False,False,,,...,528.708200,0.289261,0.289261,337.444641,0.831621,30.0,Lumos,21,28,4.803679
4,KDALLVGV,1,130,138,8,0.817415,False,False,,,...,814.503280,0.433791,0.433791,256.615234,1.260001,30.0,Lumos,28,35,26.806266
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,TLKVSQAAAELQQY,146,110,124,14,0.891976,False,False,,,...,775.414662,0.489545,0.489545,429.360870,1.062514,30.0,Lumos,3810,3823,35.294021
396,TLKVSQAAAELQQY,146,110,124,14,0.891976,False,False,,,...,517.278867,0.489545,0.489545,463.231110,0.764225,30.0,Lumos,3823,3836,35.294021
397,LSPGWGAGAAGRRW,147,6,20,14,0.842583,False,False,,,...,1441.744742,0.377743,0.377743,289.200989,1.430378,30.0,Lumos,3836,3849,18.273781
398,LSPGWGAGAAGRRW,147,6,20,14,0.842583,False,False,,,...,721.376009,0.377743,0.377743,404.633667,1.000659,30.0,Lumos,3849,3862,18.273781


Now, the predicted library can be exported in an hdf format (AlphaDIA) or translated to a tsv. The tsv translation can be very time consuming. Before the spectral library can be translated, the gene and protein column need to be mapped from the protein_df into the precursor_df. 

In [29]:
hdf_path = "D:\Software\FASTA\Human\speclib_example.hdf"

speclib.save_hdf(hdf_path)

In [30]:
from peptdeep.spec_lib.translate import translate_to_tsv
speclib.append_protein_name()
translate_to_tsv(speclib=speclib, 
                tsv =  "D:\Software\FASTA\Human\speclib_example.tsv")

100%|██████████| 1/1 [00:01<00:00,  1.51s/it]


Translation finished, it will take several minutes to export the rest precursors to the tsv file...


#### 4. matching peptides back to proteins

The peptide sequnces can be matched back to proteins using annotate_precursor_df, requiring a 'sequence' column and a protein_df like the previously loaded fasta file. This can be done with the sequence output of any search engine or before the library is generated. 

In [31]:
from alphabase.protein.fasta import annotate_precursor_df
inferred_sequences = annotate_precursor_df(sequences, fasta)
inferred_sequences

100%|██████████| 2/2 [00:00<?, ?it/s]


Unnamed: 0,start_pos,stop_pos,nAA,HLA_prob_pred,sequence,protein_id,protein_idxes,full_name,gene_org,gene_name,is_prot_nterm,is_prot_cterm,genes,proteins,cardinality
0,168,176,8,0.793702,EMSEFHNY,0,0,0,0,0,False,False,A0A024RAP8_HUMAN,A0A024RAP8,1
1,130,138,8,0.817415,KDALLVGV,1,1,1,1,1,False,False,A0A024R161_HUMAN,A0A024R161,1
2,137,145,8,0.751329,VPAGSNPF,2,2,2,2,2,False,False,A0A024R161_HUMAN,A0A024R161,1
3,170,178,8,0.940019,SEFHNYNL,3,3,3,3,3,False,False,A0A024RAP8_HUMAN,A0A024RAP8,1
4,181,189,8,0.895964,KSDFSTRW,4,4,4,4,4,False,False,A0A024RAP8_HUMAN,A0A024RAP8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,95,109,14,0.969541,QSAEEAFLLVATAY,143,143,143,143,143,False,False,A0A024R161_HUMAN,A0A024R161,1
144,329,343,14,0.756001,SPNLLTIIEMQKGD,144,144,144,144,144,False,False,A0A024RAP8_HUMAN,A0A024RAP8,1
145,5,19,14,0.733784,LLSPGWGAGAAGRR,145,145,145,145,145,False,False,A0A024R161_HUMAN,A0A024R161,1
146,110,124,14,0.891976,TLKVSQAAAELQQY,146,146,146,146,146,False,False,A0A024R161_HUMAN,A0A024R161,1
