<a href="https://colab.research.google.com/github/jalew188/PeptDeep-HLA/blob/master/nbs/HLA1_transfer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfer learning of sample-specific HLA-I models

### Cite us

Wen-Feng Zeng, Xie-Xuan Zhou, Sander Willems, Constantin Ammar, Maria Wahle, Isabell Bludau, Eugenia Voytik, Maximillian T. Strauss & Matthias Mann. AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics. Nat Commun 13, 7238 (2022). https://doi.org/10.1038/s41467-022-34904-3

> To enable GPU in colab, click `Runtime -> Change runtime type`.

In [1]:
%pip install -q git+https://github.com/MannLabs/PeptDeep-HLA.git

  Preparing metadata (setup.py) ... [?25l[?25hdone


In [2]:
#@title Upload your fasta files for HLA peptide prediction
from google.colab import files
uploaded_fasta = files.upload()

Saving human.fasta to human.fasta


In [3]:
fasta_files = list(uploaded_fasta.keys())

#### Upload training HLA peptides

It can be a tsv/txt file containing sample-specific HLA-I peptides in the 'sequence' column.

In [4]:
#@title Upload your HLA peptide file for transfer learning
HLA_sequence_file = files.upload()

Saving HLA_sequences_DIA-Umpire_RA957.tsv to HLA_sequences_DIA-Umpire_RA957.tsv


In [5]:
import pandas as pd

hla_seq_df = pd.read_table(list(HLA_sequence_file.keys())[0])
hla_seq_df['nAA'] = hla_seq_df.sequence.str.len()
hla_seq_df

Unnamed: 0,sequence,nAA
0,ETQGQQPPQR,10
1,ESAPEGQAQQR,11
2,NRNDQEATL,9
3,EHVKEVQQL,9
4,NHHLQETSF,9
...,...,...
15982,EHMELVSRL,9
15983,VTPQIDSSRI,10
15984,MPVSELTDKL,10
15985,IPISHIDDVL,10


In [6]:
test_seq_df = hla_seq_df.sample(frac=0.2)
train_seq_df = hla_seq_df.drop(index=test_seq_df.index)
len(train_seq_df), len(test_seq_df)

(12790, 3197)

#### Initialize the model

In [7]:
from peptdeep_hla.HLA_class_I import HLA_Class_I_Classifier
model = HLA_Class_I_Classifier(
    fasta_files=fasta_files
)
model.get_parameter_num()

1669697

#### Load the pretrained model

In [8]:
from peptdeep_hla.HLA_class_I import pretrained_HLA1
model.load(pretrained_HLA1)
pretrained_HLA1

'/usr/local/lib/python3.8/dist-packages/peptdeep_hla/pretrained_models/HLA1_IEDB.pt'

#### Transfer learning with the training peptides

The non-HLA peptides are automatically sampled from the fasta file as the negative training data.

In [9]:
model.train(
    train_seq_df, 
    epoch=40, warmup_epoch=10, 
    verbose=True
)

Training with padding zero sequences: True
[Training] Epoch=1, lr=1e-05, loss=0.3019891753792763
[Training] Epoch=2, lr=2e-05, loss=0.28134927060455084
[Training] Epoch=3, lr=3e-05, loss=0.2536478629335761
[Training] Epoch=4, lr=4e-05, loss=0.23166388645768166
[Training] Epoch=5, lr=5e-05, loss=0.20467489026486874
[Training] Epoch=6, lr=6e-05, loss=0.19972726609557867
[Training] Epoch=7, lr=7e-05, loss=0.1796502536162734
[Training] Epoch=8, lr=8e-05, loss=0.168984220828861
[Training] Epoch=9, lr=9e-05, loss=0.1640731580555439
[Training] Epoch=10, lr=0.0001, loss=0.1627729870378971
[Training] Epoch=11, lr=9.972609476841367e-05, loss=0.15256229555234313
[Training] Epoch=12, lr=9.890738003669029e-05, loss=0.14151700539514422
[Training] Epoch=13, lr=9.755282581475769e-05, loss=0.13732863403856754
[Training] Epoch=14, lr=9.567727288213005e-05, loss=0.12972815288230777
[Training] Epoch=15, lr=9.330127018922194e-05, loss=0.12897195341065526
[Training] Epoch=16, lr=9.045084971874738e-05, loss=

#### Testing

In [10]:
from peptdeep_hla.utils import get_random_sequences

def concat_neg_df(pos_df, prot_df, column_to_train='HLA'):
    pos_df[column_to_train] = 1
    df_list = [pos_df]
    for nAA, group_df in pos_df.groupby('nAA'):
        rnd_seqs = get_random_sequences(
            prot_df, 
            n=len(group_df),
            pep_len = nAA
        )
        df_list.append(pd.DataFrame(
            {'sequence':rnd_seqs,'nAA':nAA,column_to_train:0}
        ))
    return pd.concat(df_list).reset_index(drop=True)

def test(df):
  df = concat_neg_df(df, model.protein_df)
  model.predict(df)
  prob_list = []
  precision_list = []
  recall_list = []
  fp_list = []
  for prob in [0.5,0.6,0.7,0.8, 0.9]:
    prob_list.append(prob)
    precision_list.append(df[df.HLA_prob_pred>prob].HLA.mean())
    recall_list.append(df[df.HLA_prob_pred>prob].HLA.sum()/len(df)*2)
    fp_list.append(1-(1-df[df.HLA_prob_pred<prob].HLA).sum()/len(df)*2)
  return pd.DataFrame(dict(
    HLA_prob_pred=prob_list,
    precision=precision_list,
    recall=recall_list,
    false_positive=fp_list
  ))

In [11]:
test(train_seq_df)

Unnamed: 0,HLA_prob_pred,precision,recall,false_positive
0,0.5,0.958505,0.984285,0.042611
1,0.6,0.962818,0.979906,0.037842
2,0.7,0.968011,0.970055,0.032056
3,0.8,0.974521,0.953948,0.024941
4,0.9,0.98111,0.909617,0.017514


In [12]:
test(test_seq_df)

Unnamed: 0,HLA_prob_pred,precision,recall,false_positive
0,0.5,0.960871,0.952455,0.038786
1,0.6,0.965837,0.9462,0.033469
2,0.7,0.97077,0.934939,0.028151
3,0.8,0.977401,0.919925,0.02127
4,0.9,0.985664,0.881764,0.012825


#### Predict HLA-I peptides from fasta

In [13]:
hla_df = model.predict_from_proteins(prob_threshold=0.7)
hla_df

  lcp_array = kasai(cat_prot, suffix_array)
100%|██████████| 72/72 [55:44<00:00, 46.45s/it]


Unnamed: 0,start_pos,end_pos,nAA,HLA_prob_pred,sequence
0,3217601,3217609,8,0.887262,KYSTDVKL
1,3217603,3217611,8,0.961909,STDVKLSL
2,3217498,3217506,8,0.947899,ADSVANKL
3,9414718,9414726,8,0.991264,FPFLFQHI
4,3217523,3217531,8,0.915536,LGGLVHGK
...,...,...,...,...,...
2079182,8242400,8242414,14,0.920233,TDQVTTSDVISKKE
2079183,5161777,5161791,14,0.720403,FLFDFQKTGPPLVG
2079184,2360761,2360775,14,0.988071,RPMYAHHISSKYDE
2079185,2360760,2360774,14,0.809408,SRPMYAHHISSKYD


In [14]:
hla_df[['sequence','HLA_prob_pred']].to_csv('Predicted_HLA.tsv',index=False, sep="\t")

To download `Predicted_HLA.tsv` when using Colab, click the `Files` (folder logo) in the left panel and right-click the file to download.

In [15]:
#@title Download Predicted_HLA.tsv
from google.colab import files
files.download(f'Predicted_HLA.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [16]:
#@title Download transfer learning model
from google.colab import files
model.save('transfer_HLA.pt')
files.download(f'transfer_HLA.pt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>