------------------------------------------------------------------------

Copyright 2023 Benjamin Alexander Albert \[Karchin Lab\]

All Rights Reserved

BigMHC Academic License

runothers.ipynb

------------------------------------------------------------------------

#### Run Other Methods for Comparison Against BigMHC

Create a dir, which we will call `third_party`
 * Store all of the below downloadables will be placed in `third_party`
 * All results will be placed in a dir called `out` within the `third_party`

Install **NetMHCpan-4.1** https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1
  * Download NetMHCpan-4.1b and extract all contents to a dir called netmhcpan
    * the directory structure should be `third_party/netmhcpan/ where netmhcpan contains:
      * Linux_x86_64 dir
      * netMHCpan tcsh script
      * netMHCpan.1 file
      * netMHCpan-4.1.readme
      * test dir
  * Follow the instructions of the netmhcpan-4.1.readme to install and test NetMHCpan-4.1
    * The instructions are summarized below for completeness:
      * `wget https://services.healthtech.dtu.dk/services/NetMHCpan-4.1/data.tar.gz`
      * `tar -xvf data.tar.gz`
      * `rm data.tar.gz`
      * Edit the `netMHCpan` script and replace the default NMHOME var with the full path to the `netmhcpan` dir
      * From within the `netmhcpan`/test dir, run the following:
        * ../netMHCpan -p test.pep > test.pep.myout
        * ../netMHCpan test.fsa > test.fsa.myout
        * ../netMHCpan -hlaseq B0702.fsa -p test.pep > test.pep_userMHC.myout
        * ../netMHCpan -p test.pep -BA -xls -a HLA-A01:01,HLA-A02:01 -xlsfile NetMHCpan_myout.xls
      * Then diff each of the `.myout` files with their respective `.out` files:
        * diff test.pep.out test.pep.myout
        * diff test.fsa.out test.fsa.myout
        * diff test.pep_userMHC.out test.pep_userMHC.myout
        * diff NetMHCpan_out.xls NetMHCpan_myout.xls

Install **MHCflurry-2.0** https://github.com/openvax/mhcflurry
  * `conda install tensorflow` or `pip install tensorflow` version 2.2.0 or later
  * `pip install mhcflurry`
  * `mhcflurry-downloads fetch`

Install **MHCnuggets** https://github.com/KarchinLab/mhcnuggets
  * `git clone https://github.com/KarchinLab/mhcnuggets.git`
  * Refactor all imports `from mhcnuggets.src.X` to `from X`. Within the src dir, run:
    * `sed -i "s/from mhcnuggets.src./from /g" *.py`

Install **TransPHLA** https://github.com/a96123155/TransPHLA-AOMP
  * `git clone https://github.com/a96123155/TransPHLA-AOMP.git`
  * We need to remove the sigmoidal activation to prevent output values from being squashed to 0 or 1
    * In the `model.py` file found in `TransPHLA-AOMP/TransPHLA-AOMP`, within the `eval_step` function, do the following:
      * Replace: `y_prob_val = nn.Softmax(dim = 1)(val_outputs)[:, 1].cpu().detach().numpy()`
      * With: `y_prob_val = val_outputs[:, 1].cpu().detach().numpy()`
  * On line 99 of TransPHLA-AOMP/TransPHLA-AOMP/pHLAIformer.py, there is an erroneous indentation
    * Remove a single tab in front of `log = Logger(errLogPath)`
  * On lines 63-65 of TransPHLA-AOMP/TransPHLA-AOMP/pHLAIformer.py, change the argument type from `bool` to `int`
    * As of November 27, 2022, Argparse does not support setting argument type `bool`
  * To enable CUDA, set `use_cuda = True` on each of the following lines:
    * line 160 of TransPHLA-AOMP/TransPHLA-AOMP/pHLAIformer.py
    * line 62 of TransPHLA-AOMP/TransPHLA-AOMP/model.py

Install **MixMHCpred2.1** and **MixMHCpred2.2** https://github.com/GfellerLab/MixMHCpred/releases
  * Download and extract MixMHCpred v2.1 and v2.2
    * `wget https://github.com/GfellerLab/MixMHCpred/archive/refs/tags/v2.1.tar.gz`
    * `wget https://github.com/GfellerLab/MixMHCpred/archive/refs/tags/v2.2.tar.gz`
    * `tar -xzvf v2.1.tar.gz`
    * `tar -xzvf v2.2.tar.gz`
    * `rm v2.1.tar.gz v2.1.tar.gz`
  * Compile the models
    * `g++ -O3 MixMHCpred-2.1/lib/MixMHCpred.cc -o MixMHCpred-2.1/lib/MixMHCpred.x`
    * `g++ -O3 MixMHCpred-2.2/lib/MixMHCpred.cc -o MixMHCpred-2.2/lib/MixMHCpred.x`
  * Edit the `MixMHCpred` scripts and set the lib_path var to the full path of MixMHCpred-2.x/lib
  * Test the installation from within each of the MixMHCpred-2.x dirs:
    * `./MixMHCpred -i test/test.fa -o test/out.txt -a A0101,A2501,B0801,B1801`
    * `diff test/out_compare.txt test/out.txt`

Install **PRIME-1.0** and **PRIME-2.0** https://github.com/GfellerLab/PRIME/releases
  * Download and extract PRIME v1.0 and v2.0:
    * `wget https://github.com/GfellerLab/PRIME/archive/refs/tags/v1.0.tar.gz`
    * `wget https://github.com/GfellerLab/PRIME/archive/refs/tags/v2.0.tar.gz`
    * `tar -xzvf v1.0.tar.gz`
    * `tar -xzvf v2.0.tar.gz`
    * `rm v1.0.tar.gz v2.0.tar.gz`
  * Compile PRIME-2.0 (version 1.0 does not need compilation)
    * `g++ -O3 PRIME-2.0/lib/PRIME.cc -o PRIME-2.0/lib/PRIME.x`
  * Edit the `PRIME` scripts and set the lib_path var to the full path of PRIME-1.x/lib
  * Test the installation from within each of the PRIME-1.x dirs
    * Test PRIME-1.0
      * `./PRIME -i test/test.txt -o test/out.txt -a A0201,A0101 -mix MixMHCpred2.1_path`
      * `diff test/out_compare.txt test/out.txt`
    * Test PRIME-2.0
      * `./PRIME -i test/test.txt -o test/out.txt -a A0101,A2501,B0801,B1801 -mix MixMHCpred2.2_path`
      * `diff test/out_compare.txt test/out.txt`

Install **HLAthena** http://hlathena.tools/
  * Install Docker https://docs.docker.com/
    * Debian installation instructions: https://docs.docker.com/desktop/install/debian/
  * If on Linux, add your user to the docker group
    * `sudo usermod -aG docker $USER`
  * Pull the HLAthena docker image:
    * `docker pull ssarkizova/hlathena-external`
  * To kill all instances of HLAthena, run the following:
    * `docker stop $(docker ps -q --filter ancestor=ssarkizova/hlathena-external)`
    * This will complain if no containers are running
  
------------------------------------------------------------------------

In [1]:
import os

datadir = os.path.abspath("../data")
outdir = os.path.join(datadir, "out")
prddir = os.path.join(datadir, "prd")

verbose = False

ranks = True

# HLAthena does not appear to like tmpdir being "/tmp"
tmpdir = os.path.abspath(os.path.join(datadir, "tmp"))
tmpfile = os.path.abspath(os.path.join(tmpdir, "tmp.csv"))

thirdparty = os.path.abspath("../third_party")
pseudofile = os.path.join(thirdparty, "netmhcpan/data/MHC_pseudo.dat")

gid = os.popen("id -g $USER").read()

if not os.path.exists(tmpdir):
    os.makedirs(tmpdir)

In [5]:
import sys
import subprocess
import time

import numpy as np
import pandas as pd


def uid(df):
    df["uid"] = df["mhc"] + '_' + df["pep"]
    return df.set_index("uid", drop=True)


def subprocrun(cmd, cwd=None, pipe=(not verbose)):
    if pipe:
        res = subprocess.run(
            cmd.split(),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=True,
            cwd=cwd)
    else:
        res = subprocess.run(
            cmd.split(),
            universal_newlines=True,
            cwd=cwd)
    return res.stdout


def runmodel(func, data, name):
    out = list()
    for mhc, grp in data.groupby("mhc"):
        out.append(pd.Series(
            data=func(mhc, grp["pep"]),
            index=grp.index,
            name=name))
        try:
            os.remove(tmpfile)
        except OSError:
            pass
    return pd.concat(out)


def mhcnuggets(data, outfile):

    def _run(mhc, pep):
        pep.to_csv(
            tmpfile,
            header=False,
            index=False)
        cmd = "python predict.py" + \
            " --class=I" + \
            " --peptides={}".format(tmpfile) + \
            " --allele={}".format(mhc.replace('*','')) + \
            " --output={}".format(outfile)
        subprocrun(
            cmd=cmd,
            cwd=os.path.join(thirdparty, "mhcnuggets/mhcnuggets/src"))
        ic50 = pd.read_csv(
            outfile,
            usecols=["ic50"]).iloc[:,0].tolist()
        return 1 - (np.log(ic50) / np.log(50000))

    return runmodel(
        func=_run,
        data=data,
        name="MHCnuggets-2.4.0")


def netmhcpan(data, outfile):

    def _run(mhc, pep):
        pep.to_csv(
            tmpfile,
            header=False,
            index=False)
        cmd = "./netMHCpan" + \
            " -p {}".format(tmpfile) + \
            " -a {}".format(mhc.replace('*','')) + \
            " -xls" + \
            " -xlsfile {}".format(outfile)
        subprocrun(
            cmd=cmd,
            cwd=os.path.join(thirdparty, "netmhcpan"))
        return pd.read_csv(
            outfile,
            delimiter='\t',
            skiprows=1,
            usecols=["EL_Rank" if ranks else "EL-score"]).iloc[:,0].tolist()

    return runmodel(
        func=_run,
        data=data,
        name="NetMHCpan-4.1")


def hlathena(data, outfile):
    def _run(mhc, pep):
        pep.to_csv(
            tmpfile,
            index=False)
        cmd = "docker run --user 0:{} ".format(gid) + \
            " -v {}:{}".format(tmpdir, tmpdir) + \
            " -w {}".format(tmpdir) + \
            " ssarkizova/hlathena-external predict" + \
            " --runID hlathena" + \
            " --rundir {}".format(tmpdir) + \
            " -p {}".format(tmpfile) + \
            " -a {}".format(mhc[4:].replace('*','').replace(':',''))
        subprocrun(
            cmd=cmd,
            cwd=tmpdir)
        prd = pd.read_csv(
            os.path.join(tmpdir, "hlathena-predictions.txt"),
            delimiter='\t',
            usecols=[0,4 if ranks else 3])
        return prd.set_index("pep").loc[pep].iloc[:,0].tolist()

    return runmodel(
        func=_run,
        data=data,
        name="HLAthena")


def _gfeller(method, data, outfile):

    def _run(mhc, pep):
        pep.to_csv(
            tmpfile,
            header=False,
            index=False)
        if method=="PRIME-1.0":
            cmd = "./PRIME -mix ../MixMHCpred-2.1/MixMHCpred"
            cwd = os.path.join(thirdparty, "PRIME-1.0")
        elif method=="PRIME-2.0":
            cmd = "./PRIME -mix ../MixMHCpred-2.2/MixMHCpred"
            cwd = os.path.join(thirdparty, "PRIME-2.0")
        elif method=="MixMHCpred-2.1":
            cmd = "./MixMHCpred"
            cwd = os.path.join(thirdparty, "MixMHCpred-2.1")
        elif method=="MixMHCpred-2.2":
            cmd = "./MixMHCpred"
            cwd = os.path.join(thirdparty, "MixMHCpred-2.2")
        else:
            raise ValueError(
                "Unexpected gfeller method: {}".format(method))
        cmd += \
            " -i {}".format(tmpfile) + \
            " -a {}".format(mhc) + \
            " -o {}".format(outfile)
        subprocrun(
            cmd=cmd,
            cwd=cwd)
        try:
            return pd.read_csv(
                outfile,
                delimiter='\t',
                skiprows=11,
                usecols=["%Rank_bestAllele" if ranks else "Score_bestAllele"]).iloc[:,0].tolist()
        except Exception as e:
            print(e)
            return [float("nan") for _ in range(len(pep))]

    return runmodel(
        func=_run,
        data=data,
        name=method)


def prime1(data, outfile):
    return _gfeller("PRIME-1.0", data, outfile)


def prime2(data, outfile):
    return _gfeller("PRIME-2.0", data, outfile)


def mixmhcpred21(data, outfile):
    return _gfeller("MixMHCpred-2.1", data, outfile)


def mixmhcpred22(data, outfile):
    return _gfeller("MixMHCpred-2.2", data, outfile)


def mhcflurry(data, outfile):
    data.to_csv(
        tmpfile,
        columns=["mhc","pep"],
        header=["allele","peptide"],
        index=False)
    cmd = "mhcflurry-predict {} --out={}".format(
        tmpfile, outfile)
    subprocrun(cmd=cmd)

    out = pd.read_csv(
        outfile,
        usecols=["mhcflurry_presentation_percentile" if ranks else "mhcflurry_presentation_score"]
    ).iloc[:,0].tolist()

    name = "MHCflurry-2.0"

    out = pd.DataFrame({
        "mhc":data["mhc"],
        "pep":data["pep"],
        name:out})

    return uid(out)[name]


def transphla(data, outfile):

    chunksize = 100*1000

    mhcmap = dict()
    with open(pseudofile, 'r') as f:
        for line in f.readlines():
            line = line.strip()
            mhc = line[:line.find(' ')]
            seq = line[line.rfind(' ')+1:]
            mhcmap[mhc] = seq

    mhclines = [
        ">{}\n{}".format(
            mhc,
            mhcmap[mhc.replace('*','')])
        for mhc in data["mhc"]]

    peplines = [
        ">{}\n{}".format(
            pep,
            pep)
        for pep in data["pep"]]

    col = "y_prob"
    idx1 = 0
    preds = list()
    while idx1 < len(data):
        idx2 = min(idx1+chunksize, len(data))
        with open(outfile, 'w') as f:
            f.write('\n'.join(mhclines[idx1:idx2]))
        with open(tmpfile, 'w') as f:
            f.write('\n'.join(peplines[idx1:idx2]))
        cmd = "python pHLAIformer.py" + \
            " --peptide_file={}".format(tmpfile) + \
            " --HLA_file={}".format(outfile) + \
            " --cut_length=15" + \
            " --output_dir=/tmp" + \
            " --output_attention=0" + \
            " --output_heatmap=0" + \
            " --output_mutation=0"
        subprocrun(
            cmd=cmd,
            cwd=os.path.join(thirdparty, "TransPHLA-AOMP/TransPHLA-AOMP"))
        preds.append(
            pd.read_csv(
                "/tmp/predict_results.csv",
                usecols=[col]))
        idx1 = idx2

    out = pd.concat(preds)[col].tolist()

    name = "TransPHLA"

    out = pd.DataFrame({
        "mhc":data["mhc"],
        "pep":data["pep"],
        name:out})

    return uid(out)[name]


def run(models, filename):
    print("running {}...".format(filename[:filename.index('.')]))
    prdfile = os.path.join(prddir, filename)
    df = uid(pd.read_csv(os.path.join(outdir, filename)))
    for m in models:
        start = time.time()
        out = m(df, prdfile)
        print(m.__name__, time.time() - start)
        df = pd.concat((df, out), axis=1)
    df.to_csv(prdfile, index=False)
    return df

In [96]:
df = run([netmhcpan, mhcnuggets, mixmhcpred21, mixmhcpred22, transphla],
    "el_test.csv")

df = run([netmhcpan, mhcflurry, mhcnuggets, mixmhcpred21, mixmhcpred22, prime1, prime2, transphla, hlathena],
    "asdf.csv")

df = run([netmhcpan, mhcflurry, mhcnuggets, mixmhcpred21, mixmhcpred22, prime1, prime2, transphla, hlathena],
    "iedb.csv")