# Using peptdeep for MHC class I immunopeptidomics

Note that pydivsufsort package is not installed by peptdeep by default. Install by:
```
pip install "peptdeep[development,hla]"
```

Or install within jupyter notebook:

In [1]:
%pip install -q pydivsufsort

Note: you may need to restart the kernel to use updated packages.


## Unspecific digestion in alphabase

Longest common prefix (LCP) algorithm, which is based on suffix array data structure, has been proven to be very efficient for unspecific digestion [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-577]. Here we used `pydivsufsort`, a Python wrapper of a high-performance C library libdivsufsort [https://github.com/y-256/libdivsufsort], to facilitate LCP-based digestion.

Unspecific digestion in alphabase involves two steps:

1. Concatenate protein sequences into a single sequence, separated by a sentinel character, e.g., '$'. For instance:

In [2]:
def concat_sequences_for_nonspecific_digestion(seq_list, sep="$"):
    return sep + sep.join(seq_list) + sep

In [3]:
prot_seq_list = ["MABCDEKFGHIJKLMNOPQRST","FGHIJKLMNOPQR"]
cat_prot = concat_sequences_for_nonspecific_digestion(prot_seq_list, sep="$")
cat_prot

'$MABCDEKFGHIJKLMNOPQRST$FGHIJKLMNOPQR$'

Note that the first and last sentinel characters are crutial as well.

2. Use `alphabase.protein.lcp_digest.get_substring_indices` to get all non-redundant non-specific sequences from the concatenated sequence.

In [4]:
from alphabase.protein.lcp_digest import get_substring_indices
import pandas as pd

start_idxes, stop_idxes = get_substring_indices(
    cat_prot, min_len=8, max_len=14, stop_char="$"
)
digest_pos_df = pd.DataFrame({
    "start_pos": start_idxes,
    "stop_pos": stop_idxes,
})
digest_pos_df

Unnamed: 0,start_pos,stop_pos
0,1,9
1,1,10
2,1,11
3,1,12
4,1,13
...,...,...
79,13,22
80,13,23
81,14,22
82,14,23


All unspecific peptides can be localted by the `start_pos` and `stop_pos` in `digest_pos_df`, and all peptides are non-redundant guaranteed by the LCP algorithm.

In [39]:
import random
import string
random.seed(0)
cat_seq = '$'+''.join(random.choices(string.ascii_uppercase+'$', k=10000))+'$'
start_idxes, stop_idxes = get_substring_indices(cat_seq, min_len=7, max_len=14)
digest_pos_df = pd.DataFrame({
    "start_pos": start_idxes,
    "stop_pos": stop_idxes,
})
digest_pos_df

Unnamed: 0,start_pos,stop_pos
0,1,8
1,1,9
2,1,10
3,1,11
4,1,12
...,...,...
54935,9987,9995
54936,9987,9996
54937,9988,9995
54938,9988,9996


In [40]:
import sys
RAM_use_idxes = sys.getsizeof(digest_pos_df)*1e-6

In [41]:
digest_pos_df["sequence"] = digest_pos_df[
    ["start_pos","stop_pos"]
].apply(lambda x: cat_seq[slice(*x)], axis=1)
digest_pos_df

Unnamed: 0,start_pos,stop_pos,sequence
0,1,8,WULGNKV
1,1,9,WULGNKVI
2,1,10,WULGNKVIM
3,1,11,WULGNKVIMP
4,1,12,WULGNKVIMPY
...,...,...,...
54935,9987,9995,CESHBWDD
54936,9987,9996,CESHBWDDX
54937,9988,9995,ESHBWDD
54938,9988,9996,ESHBWDDX


In [42]:
RAM_use_seqs = sys.getsizeof(digest_pos_df["sequence"])*1e-6

In [43]:
f"idxes RAM = {RAM_use_seqs:.5f} Mb, seq RAM = {RAM_use_idxes:.5f}, ratio = {RAM_use_seqs/RAM_use_idxes:.5f}"

'idxes RAM = 3.25833 Mb, seq RAM = 0.43968, ratio = 7.41063'

To save the RAM, the `peptdeep.hla` module works on start and stop indices instead of on peptide sequences directly. This will save about 8 times of the RAM for HLA-I peptides (length from 8 to 14). For a very large protein sequence database, there will be millions of unspecific peptides, so working with strings sometimes is not feasible due to the requirements of extremely large RAM.

## Transfer learning for HLA class I prediction with `peptideep.hla`

In [44]:
from peptdeep.hla.hla_class1 import HLA1_Binding_Classifier

model = HLA1_Binding_Classifier()
model.load_pretrained_hla_model()