# Preprocessing Code and Guidelines for HOGVAX

This jupyter notebook file provides code and guidelines for parts of the preprocessing to execute HOGVAX with your own data. Here, we use some data of the SARS-CoV-2 study provided by Liu et al., 2020, *Cell Systems 11, 131–144* as an example. Feel free to use and modify the provided code for your own purposes. If you like to run HOGVAX for the case study by Liu et al., download the data using the provided [download_data.sh](../download_data.sh) file.

### Creating epitope candidates and removing cleavage regions

If a protein sequence contains cleavage sites, this fasta entry must be split into two separate entries in a fasta. Thereby, we guarantee that we are not creating and using peptides that cover a cleavage site and would therefore not occur naturally. Starting from the protein fastas of the SARS-CoV-2 proteins, we cut them into peptides using sliding windows of length 8 to 10 for MHC class I or 13 to 25 for MHC class II. We store the peptides in a text file and additionally create a csv file for further information about the peptides.


In [319]:
import copy
import numpy as np
import pandas as pd
from Bio import SeqIO
from collections import defaultdict

In [320]:
# create a simple csv file with information of cleavage positions, if necessary
# format: Start,End,Sub_Protein (new name for new fasta file),Protein (record.id as in fasta file)
csv_file = 'test/ProteinCleavageSites.csv'
protein_file = 'test/test.fasta'
cleaved_file = 'test/cleaved_test.fasta'

df_cleavage = pd.read_csv(csv_file)

new_records = []
with open(protein_file, 'r') as file:
    for record in SeqIO.parse(file, 'fasta'):
        if (df_cleavage['Protein'] == record.id).any():
            for i, entry in df_cleavage[df_cleavage['Protein'] == record.id].iterrows():
                new_record = copy.deepcopy(record)
                new_record.id = entry.Sub_Protein
                new_record.seq = record.seq[entry.Start:entry.End+1]
                new_records.append(new_record)
        else:
            new_records.append(record)

SeqIO.write(new_records, cleaved_file, 'fasta')

28

In [321]:
# specify input and output file, if you split the input sequences at cleavage sites, make sure to use this modified fasta file
input_fasta = 'test/cleaved_test.fasta'
output_peptides = 'test/peptides.pep'
output_csv = 'test/epitope_features.csv'

# give range of peptide length as min and max value
peptide_range = [8, 25]

peptides = []
proteins = []
lengths = []
indices = []
with open(input_fasta, 'r') as file:
    for record in SeqIO.parse(file, 'fasta'):
        protein_seq = str(record.seq).strip('*')
        protein = record.id
        for r in range(peptide_range[0], peptide_range[1]+1):
            curr_peptides = [protein_seq[i:i+r] for i in range(len(protein_seq)-r+1)]
            proteins += [protein] * len(curr_peptides)
            lengths += [r] * len(curr_peptides)
            indices += [i for i in range(len(protein_seq)-r+1)]
            peptides += curr_peptides

with open(output_peptides, 'w') as file:
    file.write('\n'.join(peptides))

df_all_features = pd.DataFrame(list(zip(peptides, proteins, indices, lengths)), columns=['Peptides', 'Protein', 'Index', 'Length'])
df_all_features

Unnamed: 0,Peptides,Protein,Index,Length
0,MYSFVSEE,E,0,8
1,YSFVSEET,E,1,8
2,SFVSEETG,E,2,8
3,FVSEETGT,E,3,8
4,VSEETGTL,E,4,8
...,...,...,...,...
153106,FQLTPIAVQMTKLATTEELPDEFVV,ORF9b,68,25
153107,QLTPIAVQMTKLATTEELPDEFVVV,ORF9b,69,25
153108,LTPIAVQMTKLATTEELPDEFVVVT,ORF9b,70,25
153109,TPIAVQMTKLATTEELPDEFVVVTV,ORF9b,71,25


In [322]:
df_all_features.to_csv(output_csv)

### Exclude self-peptides

To exclude self-peptides from the input peptides, you must identify the self-peptides and provide them in a text file with one peptide per row.

In [323]:
self_peptides = 'test/self_pept.pep'
with open(self_peptides, 'r') as file:
    for line in file:
        self_pep = line.strip('\n')
        df_all_features = df_all_features.drop(df_all_features[df_all_features['Peptides'] == self_pep].index)
df_all_features.set_index('Peptides')

Unnamed: 0_level_0,Protein,Index,Length
Peptides,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MYSFVSEE,E,0,8
YSFVSEET,E,1,8
SFVSEETG,E,2,8
FVSEETGT,E,3,8
VSEETGTL,E,4,8
...,...,...,...
FQLTPIAVQMTKLATTEELPDEFVV,ORF9b,68,25
QLTPIAVQMTKLATTEELPDEFVVV,ORF9b,69,25
LTPIAVQMTKLATTEELPDEFVVVT,ORF9b,70,25
TPIAVQMTKLATTEELPDEFVVVTV,ORF9b,71,25


In [324]:
df_all_features.to_csv(output_csv)

### Identifying mutation probabilities of peptides

This is based on the work by Liu et al. that you can find [here](https://github.com/gifford-lab/optivax/tree/master/covid_mutation_analysis_and_nextstrain_build). For the identification of peptides that mutated in different viral strains, you need *a)* the sequences of the different strains, and *b)* alignments of the different strains to a reference sequence. For SARS-CoV-2 this data was already provided by Liu et al. To get the alignments, they used the [nextstrain pipeline](https://docs.nextstrain.org/projects/ncov/en/latest/index.html). For further information on how to compute the alignments, follow the tutorial on the nextstrain website and check out the [README.md](https://github.com/gifford-lab/optivax/blob/master/covid_mutation_analysis_and_nextstrain_build/README.md) by Liu et al. 

The following code is based on the code provided by Liu et al. and adds the mutation probability of each of the previously computed peptides to the csv file.

In [325]:
alignment_files = !ls test/aligned_*.fasta
alignment_files

['test/aligned_protein_E.fasta']

In [326]:
import time 

def computeWindows(aa_file):
    print('computeWindows!!!')
    print('aa_file = ', aa_file) ## Alex
    
    def entropy_calc(x):
        # where x is a list of numbers. 
        summ = 0
        nc = np.sum(x)
        for e in x: 
            p = e/nc
            summ += p*np.log2(p)
        return -summ
    
    window_sizes = list(range(8,26)) 
    protein_res = []
    protein = aa_file.split('_')[-1].split('.')[0]
    print('protein:', protein)
    with open(aa_file, "rt") as handle:
        records = list(SeqIO.parse(handle, "fasta"))
    
    
    # getting rid of the node sequences!
    no_nodes = []
    for r in records: 
        if 'NODE_' not in r.id:
            no_nodes.append(r)
    records = no_nodes
    print('size of records', len(records))
    # getting the reference sequence
    for ind, r in enumerate(records): 
        if r.id == 'Wuhan-Hu-1/2019':#'Wuhan/WH01/2019':#'Wuhan/IPBCAMS-WH-01/2019':
            ref_seq_ind = ind
            
    ref_seq = str(records[ref_seq_ind].seq)
    
    seqs = np.array(records)
    
    if seqs[ref_seq_ind, -1]=='*': # ignore the stop code at the end. 
        print('removing the stop codon at the end', protein)
        ref_seq = ref_seq[:-1]
        seqs = seqs[:, :-1]
    
    for window in window_sizes:
        print('window size', window)
        
        window_res = []
        
        # get epitope based column slices. gives a list of epi columns. 
        epi_columns = [seqs[:,i:i+window] for i in range(seqs.shape[1]-window+1)]
    
        for col_ind, col in enumerate(epi_columns):
            #first need to convert each of the columns into strings:
            col = np.asarray([''.join(col[i,:]) for i in range(col.shape[0])])
            unique, counts = np.unique(col, return_counts=True) 
            ent = entropy_calc(counts)
            
            # useful for percentages
            ref_epitope = col[ref_seq_ind]
            count_dict = dict(zip(unique, counts))
            # -1 for no self count. percentage mutated. 
            perc = 1 - ((count_dict[ref_epitope]-1)/(np.sum(counts)-1))
            
            # start pos, window size, entropy
            window_res.append([protein, ref_epitope, col_ind, window, ent, perc])
            
        protein_res += window_res
        
    df = pd.DataFrame(protein_res)
    return df
    
    
ncores = 1

start_time = time.time()

### uncomment this to run it, commented for safety reasons -- Alex
# multicore generate new samples
print('Starting pooling on %d cores' % ncores)
df_list = list()
for file in alignment_files:
    df_list.append(computeWindows(file))

Starting pooling on 1 cores
computeWindows!!!
aa_file =  test/aligned_protein_E.fasta
protein: E
size of records 12789
removing the stop codon at the end E
window size 8
window size 9
window size 10
window size 11
window size 12
window size 13
window size 14
window size 15
window size 16
window size 17
window size 18
window size 19
window size 20
window size 21
window size 22
window size 23
window size 24
window size 25


In [327]:
df = pd.concat(df_list)
df.columns = ['protein', 'epitope', 'start_pos', 'epi_len', 'entropy', 'perc_mutated']

# only keep epitopes in range 8-10 and 13-25 that do not cover a cleavage region and are no self-peptides
df = df[df.epitope.isin(df_all_features.Peptides)]
df.to_csv('test/epitope_features.csv')
df.shape

(1070, 6)

### Add glycosylation probabilities

For glycosylation probability predictions, Liu et al. used the NetNGlyc N-glycosylation prediction server [Gupta et al., 2004](http://www.cbs.dtu.dk/services/NetNGlyc/) and validated the predictions for the spike protein using data from other publications. For details, check out [Liu et al., 2020](https://doi.org/10.1016/j.cels.2020.06.009). The following code is again taken from Liu et al. and modified for the purpose to give you an idea on how to design the epitope feature table, which is useful to filter the peptides.

In [328]:
df_glyco = pd.read_csv('test/glycosolation_predictions.txt', sep=' ', header=None)
df_glyco.columns =['protein', 'position', 'seq', 'prob_of_glyco']
df_glyco.head()

Unnamed: 0,protein,position,seq,prob_of_glyco
0,Wuhan_IPBCAMS-WH-01_2019_E,48,NVSL,0.6507
1,Wuhan_IPBCAMS-WH-01_2019_E,66,NSSR,0.6339
2,Wuhan_IPBCAMS-WH-01_2019_M,5,NGTI,0.7577
3,Wuhan_IPBCAMS-WH-01_2019_N,47,NNTA,0.6798
4,Wuhan_IPBCAMS-WH-01_2019_N,77,NSSP,0.2149


In [329]:
df_glyco['protein'] = df_glyco.protein.apply(lambda x:x.split('_')[-1])
df_glyco

Unnamed: 0,protein,position,seq,prob_of_glyco
0,E,48,NVSL,0.6507
1,E,66,NSSR,0.6339
2,M,5,NGTI,0.7577
3,N,47,NNTA,0.6798
4,N,77,NSSP,0.2149
...,...,...,...,...
72,S,1098,NGTH,0.5496
73,S,1134,NNTV,0.5800
74,S,1158,NHTS,0.3730
75,S,1173,NASV,0.3998


In [330]:
df['glyco_probs'] = np.empty((len(df), 0)).tolist()
for i in range(len(df_glyco)):
    protein_mask = df.protein==df_glyco.iloc[i].protein # select relevant protein
    seq_start = df['start_pos'] # zero based
    seq_end = df['start_pos']+(df['epi_len']-1)
    glyco_start = df_glyco.iloc[i].position - 1 # one based, thus -1
    glyco_end = df_glyco.iloc[i].position+(len(df_glyco.iloc[i].seq)-1)
    in_region = np.logical_and(glyco_start >= seq_start,glyco_start <= seq_end)
    #if all 4 matter::: 
    #.    front_in_region = np.logical_and(glyco_start >= seq_start,glyco_start <= seq_end)
    #     end_in_region = np.logical_and(glyco_end >= seq_start,glyco_end <= seq_end)
    #.    in_region = np.logical_or(front_in_region, end_in_region)
    in_region_and_protein = np.logical_and(protein_mask,in_region)
    
    df.loc[in_region_and_protein,'glyco_probs'] = df[in_region_and_protein]['glyco_probs'].apply(lambda x: x + [df_glyco.iloc[i].prob_of_glyco])
    #apply protein mask and then epitope mask
df.shape

(1070, 7)

In [331]:
df['glyco_probs'] = df.glyco_probs.apply(lambda x: 1.0 if len(x)>0 else 0.0)

In [332]:
# optionally remove epitopes of specific lengths
df = df[np.logical_and(df.epi_len!=11, df.epi_len!=12)]
df.shape

(941, 7)

In [333]:
df = df.set_index('epitope')
df

Unnamed: 0_level_0,protein,start_pos,epi_len,entropy,perc_mutated,glyco_probs
epitope,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MYSFVSEE,E,0,8,0.035578,0.002893,0.0
YSFVSEET,E,1,8,0.042296,0.003441,0.0
SFVSEETG,E,2,8,0.042296,0.003441,0.0
FVSEETGT,E,3,8,0.041116,0.003363,0.0
VSEETGTL,E,4,8,0.038914,0.003206,0.0
...,...,...,...,...,...,...
VNVSLVKPSFYVYSRVKNLNSSRVP,E,46,25,0.174796,0.017907,1.0
NVSLVKPSFYVYSRVKNLNSSRVPD,E,47,25,0.175817,0.017986,1.0
VSLVKPSFYVYSRVKNLNSSRVPDL,E,48,25,0.188548,0.019080,1.0
SLVKPSFYVYSRVKNLNSSRVPDLL,E,49,25,0.186565,0.018924,1.0


In [334]:
df.to_csv('test/epitope_features.csv')

### Filter epitopes and create input file

Given the epitope feature file, we can use it to filter the peptides, e.g., exclude glycosylated, likely mutated, and self-peptides. Peptides of cleavage reagions were already filtered by splitting up the sequences at the cleavage position and thus do not exist in our input data. 

In [351]:
# create list of input peptides that do not contain glycosylated, self-peptides, and cleavage peptides, and no peptides with a mutation probability larger than a given threshold, e.g. 0.001
# MHC I length < 11
peptides_mhcI = list(df[(df['glyco_probs'] != 1.0) & (df['perc_mutated'] < 0.0001) & (df['epi_len'] < 11)].index)
peptides_mhcI

# MHC II length range 13-25
peptides_mhcII = list(df[(df['glyco_probs'] != 1.0) & (df['perc_mutated'] < 0.001) & (df['epi_len'].isin(range(13,26)))].index)
peptides_mhcII

['VFLLVTLAILTAL', 'LLVTLAILTALRL', 'LVTLAILTALRLC', 'LLVTLAILTALRLC']

In [352]:
peptides_file = 'test/filtered_peptides_mhcI.pep'
with open(peptides_file, 'w') as pep_file:
    pep_file.write('\n'.join(peptides_mhcI))
    
peptidesII_file = 'test/filtered_peptides_mhcII.pep'
with open(peptidesII_file, 'w') as pep2_file:
    pep2_file.write('\n'.join(peptides_mhcII))

### Format netMHCpan predictions for HOGVAX

Make sure to have netMHCpan and netMHCIIpan installed. Based on the code by Liu et al. the following will execute netMHCpan and netMHCIIpan for each MHC allele that you provide in a MHC class I and class II file + for the peptide files created above. The binding affinity predictions are then written to the format used by HOGVAX.

In [353]:
# Load final set of HLA alleles.
hla_alleles = pd.read_csv('test/MHC1_allele_mary_cleaned.txt', names=['allele'])
hla_alleles

Unnamed: 0,allele
0,HLA-B44:04
1,HLA-B44:05
2,HLA-B44:07
3,HLA-A30:10
4,HLA-B44:02
...,...
225,HLA-B55:02
226,HLA-B67:01
227,HLA-A24:10
228,HLA-B15:32


In [354]:
# for MHC class I predictions
for allele in hla_alleles['allele']:
    outfile = allele.replace(':', '') + '_preds.xls' 
    ! ~/netMHCpan-4.1/netMHCpan -BA -p {peptides_file} -a {allele} -xls -xlsfile test/{outfile}

# /Users/sara/netMHCpan-4.1/Darwin_arm64/bin/netMHCpan -BA -p test/filtered_peptides_mhcI.pep -a HLA-B44:04 -xls -xlsfile test/HLA-B4404_preds.xls
# Wed Oct 11 17:01:50 2023
# User: sara
# PWD : /Users/sara/Documents/VaccinesProject/ivp/Code/HOGVAX
# Host: Darwin bison-skater.local 22.6.0 x86_64
# -BA      1                    Include Binding affinity prediction
# -p       1                    Use peptide input
# -a       HLA-B44:04           MHC allele
# -xls     1                    Save output to xls file
# -xlsfile test/HLA-B4404_preds.xls Filename for xls dump
# Command line parameters set to:
#	[-rdir filename]     /Users/sara/netMHCpan-4.1/Darwin_arm64 Home directory for NetMHpan
#	[-syn filename]      /Users/sara/netMHCpan-4.1/Darwin_arm64/data/synlist.bin Synaps file
#	[-v]                 0                    Verbose mode
#	[-dirty]             0                    Dirty mode, leave tmp dir+files
#	[-tdir filename]     /var/folders/jz/006cgcp100n00j3zvx6_0wvm00

In [364]:
dfs = []
for allele in hla_alleles['allele']:
    df = pd.read_csv(
        'test/' + allele.replace(':', '') + '_preds.xls',
        delimiter='\t',
        skiprows=[0],
    )
    df['Allele'] = allele
    df = df.drop(columns=['Pos', 'ID', 'core', 'icore', 'Ave', 'NB'])
    dfs.append(df)

netmhc41_data = pd.concat(dfs)
netmhc41_data['sequence_length'] = [len(x) for x in netmhc41_data['Peptide'].values]
netmhc41_data['BA_nM'] = 50000 ** (1 - netmhc41_data['BA-score'])
netmhc41_data['Locus'] = [x[:5] for x in netmhc41_data['Allele'].values]

data_pivot = netmhc41_data.pivot_table(
    index='Peptide',
    columns=['Locus', 'Allele'],
    values='BA-score',
)

data_pivot

Locus,HLA-A,HLA-A,HLA-A,HLA-A,HLA-A,HLA-A,HLA-A,HLA-A,HLA-A,HLA-A,...,HLA-C,HLA-C,HLA-C,HLA-C,HLA-C,HLA-C,HLA-C,HLA-C,HLA-C,HLA-C
Allele,HLA-A01:01,HLA-A01:02,HLA-A01:03,HLA-A01:09,HLA-A01:23,HLA-A02:01,HLA-A02:02,HLA-A02:03,HLA-A02:04,HLA-A02:05,...,HLA-C17:01,HLA-C17:02,HLA-C17:03,HLA-C17:04,HLA-C17:05,HLA-C17:06,HLA-C17:07,HLA-C18:01,HLA-C18:02,HLA-C18:03
Peptide,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
GTLIVNSV,0.0335,0.0492,0.03,0.0335,0.0298,0.1008,0.1309,0.1652,0.0712,0.1601,...,0.0624,0.0624,0.0624,0.0624,0.0624,0.0624,0.036,0.0266,0.0266,0.0267
GTLIVNSVL,0.0527,0.0788,0.0521,0.0527,0.0473,0.1899,0.2556,0.1942,0.1304,0.29,...,0.2784,0.2784,0.2784,0.2784,0.2784,0.2784,0.2048,0.0549,0.0549,0.0489
TLIVNSVL,0.0335,0.0483,0.0318,0.0335,0.0322,0.1666,0.2616,0.2559,0.1078,0.1639,...,0.1442,0.1442,0.1442,0.1442,0.1442,0.1442,0.1132,0.0586,0.0586,0.0564


In [365]:
data_pivot.to_pickle('test/netmhcpan_pred.pkl.gz', protocol=2)

In [357]:
# for MHC class II
hla2_alleles = pd.read_csv('test/MHC2_allele_marry.txt', names=['allele'])
for allele in hla2_alleles['allele']:
    outfile = allele.replace(':', '') + '_preds.xls' 
    ! ~/netMHCIIpan-4.1/netMHCIIpan -inptype 1 -f {peptidesII_file} -a {allele} -BA -xls -xlsfile test/{outfile}

# NetMHCIIpan version 4.1

# Input is in PEPTIDE format

# Prediction Mode: EL+BA

# Threshold for Strong binding peptides (%Rank)	1%
# Threshold for Weak binding peptides (%Rank)	5%

# Allele: HLA-DPA10301-DPB11301
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1   HLA-DPA10301-DPB11301        VFLLVTLAILTAL    3   LVTLAILTA     0.287        Sequence      0.017498    88.34       NA      0.318113       1600.17    26.37       
   2   HLA-DPA10301-DPB11301        LLVTLAILTALRL    4   LAILTALRL     0.780        Sequence      0.040657    63.76       NA      0.357508       1

In [367]:
dfs = []
for allele in hla2_alleles['allele']:
    try:
        df = pd.read_csv(
            'test/' + allele.replace(':', '') + '_preds.xls',
            delimiter='\t',
            skiprows=[0],
        )
    except:
        continue
    df['Allele'] = allele
    df = df.drop(columns=['Pos', 'ID', 'Ave', 'NB'])
    dfs.append(df)

netmhcII_data = pd.concat(dfs)
netmhcII_data['sequence_length'] = [len(x) for x in netmhcII_data['Peptide'].values]
netmhcII_data['Locus'] = [x[:4] if x[:3] == 'DRB' else x[:6] for x in netmhcII_data['Allele'].values]

data_pivot = netmhcII_data.pivot_table(
    index='Peptide',
    columns=['Locus', 'Allele'],
    values='Score_BA',
)

data_pivot

Locus,DRB1,DRB1,DRB1,DRB1,DRB1,DRB1,DRB1,DRB1,DRB1,DRB1,...,HLA-DQ,HLA-DQ,HLA-DQ,HLA-DQ,HLA-DQ,HLA-DQ,HLA-DQ,HLA-DQ,HLA-DQ,HLA-DQ
Allele,DRB1_0101,DRB1_0102,DRB1_0103,DRB1_0301,DRB1_0302,DRB1_0401,DRB1_0402,DRB1_0403,DRB1_0404,DRB1_0405,...,HLA-DQA10505-DQB10302,HLA-DQA10505-DQB10309,HLA-DQA10505-DQB10319,HLA-DQA10505-DQB10402,HLA-DQA10505-DQB10501,HLA-DQA10505-DQB10502,HLA-DQA10506-DQB10303,HLA-DQA10508-DQB10301,HLA-DQA10509-DQB10301,HLA-DQA10601-DQB10301
Peptide,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
LLVTLAILTALRL,0.495122,0.388826,0.198855,0.168935,0.068692,0.219269,0.179984,0.193876,0.348609,0.285299,...,0.233057,0.177336,0.177336,0.187738,0.289216,0.177012,0.220414,0.177336,0.177336,0.187643
LLVTLAILTALRLC,0.522086,0.422263,0.222608,0.149349,0.076277,0.202343,0.183839,0.189954,0.326209,0.257123,...,0.208564,0.179419,0.179419,0.21291,0.287084,0.17878,0.222672,0.179419,0.179419,0.189232
LVTLAILTALRLC,0.485306,0.386895,0.192564,0.145798,0.062607,0.185582,0.158403,0.166872,0.303599,0.244903,...,0.215915,0.167866,0.167866,0.175624,0.266871,0.163101,0.213389,0.167866,0.167866,0.175945
VFLLVTLAILTAL,0.275935,0.233229,0.156521,0.100431,0.069007,0.162666,0.139793,0.145262,0.233472,0.193843,...,0.179439,0.145602,0.145602,0.18568,0.267939,0.161334,0.18197,0.145602,0.145602,0.154648


In [368]:
data_pivot.to_pickle('test/netmhcIIpan_pred.pkl.gz', protocol=2)