------------------------------------------------------------------------

Copyright 2023 Benjamin Alexander Albert \[Karchin Lab\]

All Rights Reserved

BigMHC Academic License

makedata.ipynb

------------------------------------------------------------------------

#### Dataset Compilation

Create a dir, which we will call `data`
  * Create a dir within `data` called `src`
    * Store all of the below downloadables in the `src` dir
  * All generated datasets will be placed in a dir called `out` within the `data` dir

Download the **NetMHCpan-4.1** datasets from https://services.healthtech.dtu.dk/suppl/immunology/NAR_NetMHCpan_NetMHCIIpan/
  * Download NetMHCpan_train.tar.gz, extract it, and rename the dir to `netmhcpan_train`
  * Download each of the files in the “MS Ligands” section and copy each allele-specific file into a single space-delimited text file called `netmhcpan_test.txt`

Download the **MHCflurry-2.0** datasets from https://doi.org/10.17632/zx3kjzc3yx.3
  * Download Data_S2.csv.gz, gunzip it, and rename the csv file to `mhcflurry_s2.csv`
  * Download Data_S3.csv and name it `mhcflurry_s3.csv`
  * Download Data_S5.csv and name it `mhcflurry_s5.csv`
  * Download Data_S6.csv and name it `mhcflurry_s6.csv`

Download **PRIME-1.0** dataset from https://doi.org/10.1016/j.xcrm.2021.100194
  * Download Table S1 and name it `prime1.xlsx`

Download **PRIME-2.0** and **MixMHCpred-2.2** datasets from https://doi.org/10.1101/2022.05.23.492800
  * Download supplements/492800_file04.zip and unzip it
    * Rename DatasetS3_PRIME.xlsx to `prime2.xlsx`
    * Rename DatasetS2_ligands.txt to `mixmhcpred.txt`

Download **MHCnuggets** dataset from https://github.com/KarchinLab/mhcnuggets/tree/master/mhcnuggets/data/production/mhcI
  * Download curated_training_data.csv and rename it to `mhcnuggets.csv`

Download **TransPHLA** dataset from https://github.com/a96123155/TransPHLA-AOMP/tree/master/Dataset
  * Download train_set.zip, unzip it, and the csv file to `transphla.csv`

Download **HLAthena** dataset from https://doi.org/10.1038/s41587-019-0322-9
  * Download supplementary dataset 1 and rename it to `hlathena.xlsx`

Download **NEPdb** dataset from http://nep.whu.edu.cn/
  * Search individually for HLA-A/B/C, download the results, and name the files `nepdb_a.csv`, `nepdb_b.csv`, and `nepdb_c.csv`
  * Downloaded files on December 18, 2022

Download **Neopepsee** dataset from https://doi.org/10.1093/annonc/mdy022
  * Download the supplementary data zip file and extract the contents
  * Rename Supplemental_Table_S5 to `neopepsee.xlsx`

Download **TESLA** immunogenicity dataset from: https://doi.org/10.1016/j.cell.2020.09.015
  * Download Table S4 and name it `tesla_s4.xlsx`
  * Download Table S7 and name it `tesla_s7.xlsx`

Download **IEDB** infectious disease immunogenicity data from http://www.iedb.org/
  * Query with parameters: (epitope) linear peptide, (assay) T cell with both positive and negative outcomes, (MHC restriction) class I, (Host) human, (Disease) infectious
  * After searching, select the “Assays” tab and then click “Export Results”
  * Extract the zipped folder and rename the enclosed file to `iedb.csv`
  * Downloaded files on December 19, 2022

Download the **MANAFEST** dataset from https://doi.org/10.17632/dvmz6pkzvb.1 called `manafest.csv`

------------------------------------------------------------------------

In [1]:
import pandas as pd
import os

#### Only parameter to set is the path to the datadir

In [2]:
datadir = os.path.join(os.pardir, "data")

srcpath = os.path.join(datadir, "src")
outpath = os.path.join(datadir, "out")


def filterlen(df, minlen=8, maxlen=15):
    """remove peptides of length fewer than 8 or greater than 15 residues"""
    return df[df.pep.apply(lambda x: (len(x) >= minlen) and (len(x) <= maxlen))].copy()


def uid(df):
    """set index as mhc_pep"""
    df["uid"] = df.mhc + '_' + df.pep
    return df.set_index("uid", drop=True)


def dedup(df):
    """remove contradictory pos/neg instances and then deduplicate"""
    # first copy all sorted duplicates and remove them from df
    dupidx = df.index.duplicated(keep=False)
    dups = df[dupidx].copy().sort_index()
    df = df[~dupidx]
    # then remove any contradictory pos/neg instances in the dups
    dups["rm"] = 0
    x = 0
    while x < len(dups)-1:
        rm = 0
        for y in range(x+1,len(dups)):
            if dups.index[x] != dups.index[y]:
                break
            if dups.tgt[x] != dups.tgt[y]:
                rm = 1
        dups.iloc[x:y,-1] = rm
        x = y
    dups = dups[dups.rm==0]
    # lastly deduplicate and concat
    dups = dups[~dups.index.duplicated()]
    dups.drop(["rm"], axis=1, inplace=True)
    return pd.concat((df,dups)).sort_index()


def standardizeDF(df):
    """
    Function to be called on every loaded dataset for
    standardizing MHC and peptide values and deduplication.
    Assumes dataframe has columns: mhc, pep, tgt
    """
    df.mhc = df.mhc.apply(str.upper)
    df.pep = df.pep.apply(str.upper)
    df.tgt = df.tgt.astype(int)
    df = dedup(uid(filterlen(df)))
    return df.sort_index()

#### Load NetMHCpan-4.1 data

In [3]:
# Load NetMHCpan-4.1 training data into var: netmhcpanTrain
print("loading NetMHCpan-4.1 training data...")
netmhcpanTrain = \
    [pd.read_csv(
        os.path.join(
            srcpath,
            "netmhcpan_train",
            "c00{}_el".format(x)),
        delimiter=' ',
        header=None,
        names=["pep","tgt","mhc"])
    for x in range(5)]
netmhcpanTrain = pd.concat(netmhcpanTrain)
print("removing non-HLA pMHCs...")
netmhcpanTrain = netmhcpanTrain[netmhcpanTrain.mhc.apply(lambda x: x.startswith("HLA"))]
print("standardizing allele names...")
netmhcpanTrain.mhc = netmhcpanTrain.mhc.apply(lambda x: "HLA-{}*{}:{}".format(x[4], x[5:7], x[-2:]))
netmhcpanTrain = netmhcpanTrain[["mhc","pep","tgt"]]
print("standardizing dataframe...")
netmhcpanTrain = standardizeDF(netmhcpanTrain)
print("done")

loading NetMHCpan-4.1 training data...
removing non-HLA pMHCs...
standardizing allele names...
standardizing dataframe...
done


In [4]:
# Load NetMHCpan-4.1 testing data into var: netmhcpanTest
print("loading NetMHCpan-4.1 testing data...")
netmhcpanTest = pd.read_csv(
    os.path.join(srcpath, "netmhcpan_test.txt"),
    delimiter=' ',
    header=None,
    names=["pep","tgt","mhc"])
print("standardizing allele names...")
netmhcpanTest.mhc = netmhcpanTest.mhc.apply(lambda x: "HLA-{}*{}:{}".format(x[4], x[5:7], x[-2:]))
netmhcpanTest = netmhcpanTest[["mhc","pep","tgt"]]
print("standardizing dataframe...")
netmhcpanTest = standardizeDF(netmhcpanTest)
print("done")

loading NetMHCpan-4.1 testing data...
standardizing allele names...
standardizing dataframe...
done


#### Load MHCflurry-2.0 data

In [5]:
# Load MHCflurry-2.0 EL training data into var: mhcflurryTrain
print("loading MHCflurry-2.0 training data...")
cols = ["hla", "peptide", "hit"]
print("loading S2...")
s2 = pd.read_csv(
    os.path.join(srcpath, "mhcflurry_s2.csv"),
    usecols=cols)
print("loading S5...")
s5 = pd.read_csv(
    os.path.join(srcpath, "mhcflurry_s5.csv"),
    usecols=cols)
s2 = s2[cols]
s5 = s5[cols]
s2.columns = ["mhc","pep","tgt"]
s5.columns = ["mhc","pep","tgt"]
mhcflurryTrain = pd.concat((s2, s5))
del s2, s5
print("standardizing dataframe...")
mhcflurryTrain = standardizeDF(mhcflurryTrain)
print("done")

loading MHCflurry-2.0 training data...
loading S2...
loading S5...
standardizing dataframe...
done


In [6]:
# Load other MHCflurry-2.0 training data: mhcflurryOther
print("loading other MHCflurry-2.0 data...")
print("loading S3...")
s3 = pd.read_csv(
    os.path.join(srcpath, "mhcflurry_s3.csv"),
    usecols=["allele", "peptide"])
print("loading S5...")
s5 = pd.read_csv(
    os.path.join(srcpath, "mhcflurry_s5.csv"),
    usecols=["hla", "peptide"])
print("loading S6...")
s6 = pd.read_csv(
    os.path.join(srcpath, "mhcflurry_s6.csv"),
    usecols=["hla", "peptide"])
s3 = s3[["allele", "peptide"]]
s5 = s5[["hla",    "peptide"]]
s6 = s6[["hla",    "peptide"]]
s3.columns = ["mhc","pep"]
s5.columns = ["mhc","pep"]
s6.columns = ["mhc","pep"]
print("exploding S6 multiallelic peptides...")
s6.pep = s6.pep.apply(lambda x: x.split(' '))
s6 = s6.explode("pep")
mhcflurryOther = pd.concat((s3,s5,s6))
del s3, s5, s6
print("done")

loading other MHCflurry-2.0 data...
loading S3...
loading S5...
loading S6...
exploding S6 multiallelic peptides...
done


#### Load PRIME-1.0 data

In [7]:
# Load all PRIME-1.0 data into var: prime1
print("loading PRIME neoepitopes...")
cols = ["Allele", "Mutant", "Immunogenicity", "StudyOrigin"]
prime1 = pd.read_excel(
    os.path.join(srcpath, "prime1.xlsx"),
    skiprows=2,
    usecols=cols)
prime1 = prime1[cols]
prime1["rand"] = prime1.StudyOrigin.apply(lambda x: x == "Random")
prime1.drop("StudyOrigin", axis=1, inplace=True)
prime1.columns = ["mhc", "pep", "tgt", "rand"]
prime1.mhc = prime1.mhc.apply(lambda x: "HLA-{}*{}:{}".format(x[0], x[1:3], x[3:]))
print("standardizing dataframe...")
prime1 = standardizeDF(prime1)
print("done")

loading PRIME neoepitopes...
standardizing dataframe...
done


#### Load PRIME-2.0 data

In [8]:
# Load PRIME-2.0 neoepitope data into var: prime2
print("loading PRIME neoepitopes...")
cols = ["Allele", "Mutant", "Immunogenicity","Random"]
prime2 = pd.read_excel(
    os.path.join(srcpath, "prime2.xlsx"),
    skiprows=1,
    usecols=cols)
prime2 = prime2[cols]
prime2.columns = ["mhc", "pep", "tgt", "rand"]
prime2.mhc = prime2.mhc.apply(lambda x: "HLA-{}*{}:{}".format(x[0], x[1:3], x[3:]))
print("standardizing dataframe...")
prime2 = standardizeDF(prime2)
print("done")

loading PRIME neoepitopes...
standardizing dataframe...
done


#### Load MixMHCpred-2.2 data

In [9]:
print("loading MixMHCpred-2.2 data...")
mixmhcpred = pd.read_csv(
    os.path.join(srcpath, "mixmhcpred.txt"),
    delimiter='\t',
    skiprows=1)
mixmhcpred = mixmhcpred[["Allele", "Peptide"]]
mixmhcpred.columns = ["mhc", "pep"]
mixmhcpred.mhc = mixmhcpred.mhc.apply(lambda x: "HLA-" + x[0] + '*' + x[1:3] + ':' + x[-2:])
print("done")

loading MixMHCpred-2.2 data...
done


#### Load MHCnuggets-2.4.0 data

In [10]:
print("reading mhcnuggets...")
cols = ["mhc", "peptide"]
mhcnuggets = pd.read_csv(
    os.path.join(srcpath, "mhcnuggets.csv"),
    usecols=cols)
mhcnuggets = mhcnuggets[["mhc", "peptide"]]
mhcnuggets.columns = ["mhc","pep"]
print("done")

reading mhcnuggets...
done


#### Load TransPHLA data

In [11]:
print("loading TransPHLA data...")
cols = ["HLA","peptide"]
transphla  = pd.read_csv(
    os.path.join(srcpath, "transphla.csv"),
    usecols=cols)
transphla = transphla[cols].copy()
transphla.columns = ["mhc", "pep"]
print("done")

loading TransPHLA data...
done


#### Load HLAthena data

In [12]:
print("loading HLAthena...")
hlathenaExcel = pd.ExcelFile(os.path.join(srcpath, "hlathena.xlsx"))
hlathena = list()
for x in hlathenaExcel.sheet_names[1:]:
    hlathena.append(hlathenaExcel.parse(x, usecols=["sequence"]))
    hlathena[-1]["mhc"] = "HLA-" + x[0] + '*' + x[1:3] + ':' + x[-2:]
hlathena = pd.concat(hlathena)
hlathena = hlathena[["mhc", "sequence"]]
hlathena.columns = ["mhc", "pep"]
del hlathenaExcel
print("done")

loading HLAthena...
done


#### Load NEPdb data

In [13]:
# Load NEPdb HLA-A/B/C data into var: nepdb
print("loading NEPdb data...")
cols = ["alleleA", "mut_peptide", "response"]
nepdb = pd.concat(
    [pd.read_csv(
        os.path.join(srcpath, "nepdb_{}.csv".format(x)),
        usecols=cols)
    for x in ['a','b','c']])
nepdb = nepdb[cols]
nepdb.columns = ["mhc", "pep", "tgt"]
nepdb.mhc = nepdb.mhc.apply(lambda x: "HLA-" + x)
nepdb.tgt = nepdb.tgt.apply(lambda x: int(x.lower()=='p'))
print("standardizing dataframe...")
nepdb = standardizeDF(nepdb)
print("done")

loading NEPdb data...
standardizing dataframe...
done


#### Load Neopepsee data

In [14]:
# Load Neopepsee data into var: neopepsee
print("loading Neopepsee data...")
sciencecols = ["IEDB_A_type", "WT_AA", "Answer"]
bloodcols = ["HLA-A_allele", "WT_AA", "Answer"]
science = pd.read_excel(
    os.path.join(srcpath, "neopepsee.xlsx"),
    sheet_name="Science",
    usecols=sciencecols)
blood = pd.read_excel(
    os.path.join(srcpath, "neopepsee.xlsx"),
    sheet_name="Blood",
    usecols=bloodcols)
science = science[sciencecols]
blood = blood[bloodcols]
science.columns = ["mhc","pep","tgt"]
blood.columns = ["mhc","pep","tgt"]
science.tgt = science.tgt.apply(lambda x: int(x))
blood.mhc = blood.mhc.apply(lambda x: "HLA-" + x)
blood.tgt = blood.tgt.apply(lambda x: int(x.strip()[0].lower() == "p"))
neopepsee = pd.concat((science, blood))
del science, blood
print("standardizing dataframe...")
neopepsee = standardizeDF(neopepsee)
print("done")

loading Neopepsee data...
standardizing dataframe...
done


#### Load TESLA data

In [15]:
# Load TESLA S4 and S7 data into var: tesla
print("loading TESLA data...")
cols = ["PMHC", "VALIDATED"]
s4 = pd.read_excel(
    os.path.join(srcpath, "tesla_s4.xlsx"),
    usecols=cols)
s7 = pd.read_excel(
    os.path.join(srcpath, "tesla_s7.xlsx"),
    usecols=cols)
s4.columns = ["pmhc", "tgt"]
s7.columns = ["pmhc", "tgt"]
s4["mhc"] = s4.pmhc.apply(lambda x: "HLA-" + x[:x.index('_')])
s7["mhc"] = s7.pmhc.apply(lambda x: "HLA-" + x[:x.index('_')])
s4["pep"] = s4.pmhc.apply(lambda x: x[x.index('_')+1:])
s7["pep"] = s7.pmhc.apply(lambda x: x[x.index('_')+1:])
s4 = s4[["mhc","pep","tgt"]]
s7 = s7[["mhc","pep","tgt"]]
tesla = pd.concat((s4, s7))
del s4, s7
print("standardizing dataframe...")
tesla = standardizeDF(tesla)
print("done")

loading TESLA data...
standardizing dataframe...
done


#### Load IEDB data

In [16]:
# Load IEDB data into var: iedb
print("loading IEDB data...")
iedb = pd.read_csv(
    os.path.join(srcpath, "iedb.csv"),
    skiprows=1,
    usecols=[11,87,101])
iedb.columns = ["pep","tgt","mhc"]
iedb = iedb[["mhc","pep","tgt"]]
iedb = iedb[iedb.mhc.apply(lambda x: (' ' not in x) and ('-' in x) and ('*' in x) and (':' in x))]
# NetMHCpan does not have a pseudosequence for the below alleles
iedb = iedb[iedb.mhc.apply(lambda x: (x != "HLA-B*44:01"))]
# MixMHCpred cannot make predictions for the below alleles
iedb = iedb[iedb.mhc.apply(lambda x: (x != "HLA-A*02:09"))]
iedb = iedb[iedb.mhc.apply(lambda x: (x != "HLA-B*07:06"))]
iedb = iedb[iedb.mhc.apply(lambda x: (x != "HLA-B*40:10"))]
# HLAthena cannot make predictions for the below alleles
iedb = iedb[iedb.mhc.apply(lambda x: (x != "HLA-A*02:16"))]
iedb.tgt = iedb.tgt.apply(lambda x: int(x.strip()[0].lower() == "p"))
print("standardizing dataframe...")
iedb = standardizeDF(iedb)
print("done")

loading IEDB data...
standardizing dataframe...
done


#### Load MANAFEST data

In [17]:
# Load MANAFEST data into var: manafest
print("loading MANAFEST data...")
manafest = pd.read_csv(os.path.join(srcpath, "manafest.csv"))
manafest = manafest[["mhc","pep","tgt"]]
print("standardizing dataframe...")
manafest = standardizeDF(manafest)
print("done")

loading MANAFEST data...
standardizing dataframe...
done


#### Make training and evaluation datasets

In [23]:
def rmintersection(df, *others):
    for other in others:
        df = df[~df.index.isin(other.index)].copy()
    return df


print("making el_trainval...")
eltrainval = dedup(pd.concat((netmhcpanTrain, mhcflurryTrain))).sort_index()
eltrainval = filterlen(eltrainval, maxlen=14)

print("making im_trainval...")
prime = dedup(pd.concat((prime1, prime2))).sort_index()
imtrainval = prime[prime.rand == 0].copy()
imtrainval = filterlen(imtrainval, minlen=9, maxlen=10)

print("making im_test...")
imtest = dedup(pd.concat((manafest, tesla, nepdb, neopepsee))).sort_index()

print("removing el_test from el_trainval...")
eltrainval = rmintersection(eltrainval, netmhcpanTest)

print("removing all EL data and prime from immuno test data...")
imtest = rmintersection(imtest, eltrainval, netmhcpanTest, prime)
iedbTest = rmintersection(iedb.copy(), eltrainval, netmhcpanTest, prime)

print("removing long peptides from immuno test data...")
# HLAthena can only make predictions on peptides of length [8,11]
imtest = filterlen(imtest, minlen=8, maxlen=11)
iedbTest = filterlen(iedbTest, minlen=8, maxlen=11)

print("removing all other training data from immuno test data...")
otherTrain = uid(pd.concat((mhcflurryOther, mixmhcpred, mhcnuggets, transphla, hlathena)))
imtest = rmintersection(imtest, otherTrain)
iedbTest = rmintersection(iedbTest, otherTrain)

print("splitting el_trainval and im_trainval into training and validation...")

elval = eltrainval.iloc[::10].copy()
eltrain = eltrainval[~eltrainval.index.isin(elval.index)]

imval = imtrainval.iloc[::10].copy()
imtrain = imtrainval[~imtrainval.index.isin(imval.index)]

print("making complete EL dataset...")

el = pd.concat((eltrainval, netmhcpanTest)).sort_index()

print("done")

making el_trainval...
making im_trainval...
making im_test...
removing el_test from el_trainval...
removing all EL data and prime from immuno test data...
removing long peptides from immuno test data...
removing all other training data from immuno test data...
splitting el_trainval and im_trainval into training and validation...
making complete EL dataset...
done


#### Write training and evaluation datasets to output dir

In [24]:
if not os.path.exists(outpath):
    os.makedirs(outpath)

print("writing el_train...")
eltrain.to_csv(os.path.join(outpath, "el_train.csv"), index=False)
print("writing el_val...")
elval.to_csv(os.path.join(outpath, "el_val.csv"), index=False)
print("writing el_trainval...")
eltrainval.to_csv(os.path.join(outpath, "el_trainval.csv"), index=False)
print("writing el_test...")
netmhcpanTest.to_csv(os.path.join(outpath, "el_test.csv"), index=False)
print("writing el...")
el.to_csv(os.path.join(outpath, "el.csv"), index=False)

print("writing im_train...")
imtrain.to_csv(os.path.join(outpath, "im_train.csv"), index=False)
print("writing im_val...")
imval.to_csv(os.path.join(outpath, "im_val.csv"), index=False)
print("writing im_trainval...")
imtrainval.to_csv(os.path.join(outpath, "im_trainval.csv"), index=False)
print("writing im_test...")
imtest.to_csv(os.path.join(outpath, "im_test.csv"), index=False)

print("writing iedb...")
iedbTest.to_csv(os.path.join(outpath, "iedb.csv"), index=False)


print("done")

writing el_train...
writing el_val...
writing el_trainval...
writing el_test...
writing el...
writing im_train...
writing im_val...
writing im_trainval...
writing im_test...
writing iedb...
done
