# Programmatic usage of NeoFox

NeoFox provides an Application Programming Interface (API) that enables the integration into other applications. This API relies heavily on Protocol Buffers data models that provide placeholder objects to store the required data while enabling different representations, data manipulation, validation and normalization. We use the Protocol Buffers data models to generate Python code automatically and to implement validation and normalization around them, but Protocol Buffers is technology agnostic thus this may facilitate the integration with third party applications not necessarily implemented in Python (see https://developers.google.com/protocol-buffers). The API is tightly integrated with the Python data analysis library Pandas (see https://pandas.pydata.org/).

Here we show: 

* how to create new model objects
* how to import/export these objects into different representations
* how to manipulate them
* how to validate and normalize on the data 

And finally we show how to run NeoFox programmatically, you may want to skip to this part for a quick grasp of the API usage.

## Neoantigens

The neoantigen is the central piece of information that NeoFox handles, all output annotations refer to a neoantigen. A neoantigen is formed by two subentities transcript and mutation, plus some additional attributes. Here we show how to create a neoantigen, transform it into different representations and validate it.

### Create a neoantigen

Create a transcript:

In [1]:
from neofox.model.neoantigen import Transcript
transcript = Transcript(
    assembly="hg19", 
    gene="VCAN", 
    identifier="uc003kii.3")

Create a mutation:

In [2]:
from neofox.model.neoantigen import Mutation
mutation = Mutation(
    position=1007, 
    wild_type_aminoacid="I", 
    mutated_aminoacid="T", 
    left_flanking_region="DEVLGEPSQDILV", 
    right_flanking_region="DQTRLEATISPET")

Create a neoantigen using the previous transcript and mutation:

In [3]:
from neofox.model.neoantigen import Neoantigen
neoantigen = Neoantigen(
    transcript=transcript, 
    mutation=mutation, 
    patient_identifier="P123", 
    rna_expression=0.519506894, 
    rna_variant_allele_frequency=0.857142857, 
    dna_variant_allele_frequency=0.294573643)

### Representation into different formats

The same piece of data agreeing with NeoFox data models can be represented in different formats. Here we show how to transform the data between several formats: JSON, Python dictionaries, Protocol Buffers binary representations, Pandas dataframes and tabular representations in files. This is relevant for enabling data import and export and adding flexibility to the integration with other tools.

What is shown here is applicable to all entities in NeoFox data models.

These objects can be easily transformed into JSON:

In [4]:
print(neoantigen.to_json(indent=2))

{
  "patientIdentifier": "P123",
  "transcript": {
    "identifier": "uc003kii.3",
    "assembly": "hg19",
    "gene": "VCAN"
  },
  "mutation": {
    "position": 1007,
    "wildTypeAminoacid": "I",
    "mutatedAminoacid": "T",
    "leftFlankingRegion": "DEVLGEPSQDILV",
    "rightFlankingRegion": "DQTRLEATISPET"
  },
  "rnaExpression": 0.519506894,
  "dnaVariantAlleleFrequency": 0.294573643,
  "rnaVariantAlleleFrequency": 0.857142857
}


They can also be transformed into a Python native dictionary:

In [5]:
transcript.to_dict()

{'identifier': 'uc003kii.3', 'assembly': 'hg19', 'gene': 'VCAN'}

And also into the Protocol Buffers binary format that allows a better compression for storing the data or sending it over the wire:

In [6]:
mutation.SerializeToString()

b'\x08\xef\x07\x1a\x01I*\x01T2\rDEVLGEPSQDILVB\rDQTRLEATISPET'

### Integration with Pandas

NeoFox integrates with the Python library for data analysis Pandas (see https://pandas.pydata.org/). A single object can be transformed into a Pandas `Series` and a list of objects can be transformed into a Pandas `DataFrame`. Pandas provide functionality to persist this tabular representations to files that can be stored and imported into other environments, for instance R.

What is shown here is applicable to all entities in NeoFox data models.

Transform a neoantigen into a Pandas `Series`:

In [7]:
from neofox.model.conversion import ModelConverter
ModelConverter.object2series(neoantigen)

clonality_estimation                           False
dna_variant_allele_frequency                0.294574
identifier                                          
mutation.left_flanking_region          DEVLGEPSQDILV
mutation.mutated_aminoacid                         T
mutation.mutated_xmer                               
mutation.position                               1007
mutation.right_flanking_region         DQTRLEATISPET
mutation.size_left_flanking_region                 0
mutation.size_right_flanking_region                0
mutation.wild_type_aminoacid                       I
mutation.wild_type_xmer                             
patient_identifier                              P123
rna_expression                              0.519507
rna_variant_allele_frequency                0.857143
transcript.assembly                             hg19
transcript.gene                                 VCAN
transcript.identifier                     uc003kii.3
Name: 0, dtype: object

Transform a list of transcripts into a Pandas `DataFrame`:

In [8]:
mutation2 = Mutation(
    position=126, 
    wild_type_aminoacid="A", 
    mutated_aminoacid="G", 
    left_flanking_region="AAAAAAAAAAAAA", 
    right_flanking_region="AAAAAAAAAAAAA")
mutations_df = ModelConverter.objects2dataframe([mutation, mutation2])
mutations_df

Unnamed: 0,leftFlankingRegion,mutatedAminoacid,mutatedXmer,position,rightFlankingRegion,sizeLeftFlankingRegion,sizeRightFlankingRegion,wildTypeAminoacid,wildTypeXmer
0,DEVLGEPSQDILV,T,,1007,DQTRLEATISPET,0,0,I,
1,AAAAAAAAAAAAA,G,,126,AAAAAAAAAAAAA,0,0,A,


Persist any Pandas object into a file:

In [9]:
mutations_df.to_csv("my_mutations.csv")

And read it back:

In [10]:
import pandas as pd
mutations_df2 = pd.read_csv("my_mutations.csv")
mutations = []
for _, row in mutations_df2.iterrows():
    nested_dict = ModelConverter._flat_dict2nested_dict(flat_dict=row.to_dict())
    mutations.append(Mutation().from_dict(row.to_dict()))
ModelConverter.objects2dataframe(mutations)

Unnamed: 0,leftFlankingRegion,mutatedAminoacid,mutatedXmer,position,rightFlankingRegion,sizeLeftFlankingRegion,sizeRightFlankingRegion,wildTypeAminoacid,wildTypeXmer
0,DEVLGEPSQDILV,T,,1007,DQTRLEATISPET,0,0,I,
1,AAAAAAAAAAAAA,G,,126,AAAAAAAAAAAAA,0,0,A,


### Data validation

The quality and cleanliness of data is of great importance to enable an effective data analysis and make the data machine readable. Clean data means that the data is valid and that it is in a normal and homogeneous form. The use of controlled vocabularies help to represent knowledge in a standardised way. This is a domain specific task, although it can be assisted with the right tools such as Pandas in Python or tidyverse in R, it requires domain expertise to perform it. NeoFox provides this domain expertise out of the box with its validation and normalization layers on top of its data models.

The data validation checks for missing required fields and shows relevant messages.

In [11]:
from neofox.model.conversion import ModelValidator
from neofox.exceptions import NeofoxDataValidationException
try:
    ModelValidator.validate_neoantigen(neoantigen=Neoantigen())
except NeofoxDataValidationException as e:
    print("Error message: {}".format(e))

Error message: Empty gene symbol


It also performs more domain specific validations such as aminoacids being valid according to the IUPAC standard aminoacid representation.

In [12]:
try:
    ModelValidator.validate_neoantigen(neoantigen=Neoantigen(
        transcript=Transcript(
            assembly="hg19", 
            gene="VCAN", 
            identifier="uc003kii.3"
        ),
        mutation=Mutation(
            position=126, 
            wild_type_aminoacid="A", 
            mutated_aminoacid="G", 
            left_flanking_region="123456", # wrong aminoacid representation
            right_flanking_region="AAAAAAAAAAAAA")))
except NeofoxDataValidationException as e:
    print("Error message: {}".format(e))

Error message: Non existing aminoacid 1


The data normalization layer ensures the aminoacid representation is normalized into 1 letter IUPAC codes, although 3 letters IUPAC codes can be provided.

In [13]:
valid_neoantigen = ModelValidator.validate_neoantigen(neoantigen=Neoantigen(
    transcript=Transcript(
        assembly="hg19", 
        gene="VCAN", 
        identifier="uc003kii.3"
    ),
    mutation=Mutation(
        position=126, 
        wild_type_aminoacid="Ala", # 3 letter IUPAC code
        mutated_aminoacid="Gly", 
        left_flanking_region="AAAAAAAAAAAAA",
        right_flanking_region="AAAAAAAAAAAAA")))

print(valid_neoantigen.mutation.to_json(indent=2))

{
  "position": 126,
  "wildTypeXmer": "AAAAAAAAAAAAAAAAAAAAAAAAAAA",
  "wildTypeAminoacid": "A",
  "mutatedXmer": "AAAAAAAAAAAAAGAAAAAAAAAAAAA",
  "mutatedAminoacid": "G",
  "leftFlankingRegion": "AAAAAAAAAAAAA",
  "sizeLeftFlankingRegion": 13,
  "rightFlankingRegion": "AAAAAAAAAAAAA",
  "sizeRightFlankingRegion": 13
}


The validation of a neoantigen fills in some redundant representations, see below the fields `wildTypeXmer`, `mutatedXmer`, `sizeLeftFlankingRegion` and `sizeRightFlankingRegion` that were never provided. After validation a unique neoantigen identifier is generated, this is a hash function of the normalized neoantigen representation, thus two different representations of the same neoantigen will share the same identifier after normalization.

In [14]:
validated_neoantigen = ModelValidator.validate_neoantigen(neoantigen=neoantigen)
print(validated_neoantigen.to_json(indent=2))

{
  "identifier": "jETwpX0R9iEiQz2SMpHkPQ==",
  "patientIdentifier": "P123",
  "transcript": {
    "identifier": "uc003kii.3",
    "assembly": "hg19",
    "gene": "VCAN"
  },
  "mutation": {
    "position": 1007,
    "wildTypeXmer": "DEVLGEPSQDILVIDQTRLEATISPET",
    "wildTypeAminoacid": "I",
    "mutatedXmer": "DEVLGEPSQDILVTDQTRLEATISPET",
    "mutatedAminoacid": "T",
    "leftFlankingRegion": "DEVLGEPSQDILV",
    "sizeLeftFlankingRegion": 13,
    "rightFlankingRegion": "DQTRLEATISPET",
    "sizeRightFlankingRegion": 13
  },
  "rnaExpression": 0.519506894,
  "dnaVariantAlleleFrequency": 0.294573643,
  "rnaVariantAlleleFrequency": 0.857142857
}


The data normalization overrides incoherent redundant data.

In [15]:
valid_neoantigen = ModelValidator.validate_neoantigen(neoantigen=Neoantigen(
    transcript=Transcript(
        assembly="hg19", 
        gene="VCAN", 
        identifier="uc003kii.3"
    ),
    mutation=Mutation(
        position=126, 
        wild_type_aminoacid="Ala",
        mutated_aminoacid="Gly", 
        left_flanking_region="AAAAAAAAAAAAA",
        right_flanking_region="AAAAAAAAAAAAA",
        mutated_xmer="GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"  # incoherent mutated xmer
    )))

print(valid_neoantigen.mutation.to_json(indent=2))

{
  "position": 126,
  "wildTypeXmer": "AAAAAAAAAAAAAAAAAAAAAAAAAAA",
  "wildTypeAminoacid": "A",
  "mutatedXmer": "AAAAAAAAAAAAAGAAAAAAAAAAAAA",
  "mutatedAminoacid": "G",
  "leftFlankingRegion": "AAAAAAAAAAAAA",
  "sizeLeftFlankingRegion": 13,
  "rightFlankingRegion": "AAAAAAAAAAAAA",
  "sizeRightFlankingRegion": 13
}


## Patients

The neoantigen annotation process needs some context information, in particular some data about the individual where the somatic mutation creating this neoantigen took place. This information includes mainly the HLA types of the patient which is key to compute the binding of the potential neoepitopes.

### Parse MHC I alleles into a normal representation

The main complexity in the patient model is the representation of the MHC I and MHC II alleles present in the patient. The HLA alleles are typically represented using the nomenclature defined here http://hla.alleles.org, but de facto there is certain flexibility in the representation of HLA alleles in the community. NeoFox aims at normalizing the different HLA representations into a controlled representation agreeing with the HLA nomenclature. NeoFox only supports the classic MHC genes and although the provided HLA type is kept internally it only works with the first 4 digits.

There are specific functions in NeoFox to parse a list of non normal HLA alleles into a normalized representation of the HLA alleles including information on the zygosity.

Parse a list of MHC I alleles. The data validation will ensure that the data is valid and it will infer the zygosity of the different genes. The data normalization layer will normalize the HLA representation into a the valid HLA nomenclature including the first 4 digits.

In [16]:
mhc1 = ModelConverter.parse_mhc1_alleles(["HLA-A*01:01:02:03N", "HLA-A*01:02:02:03N", 
                                          "HLA-B*01:01:02:03N", "HLA-B*01:01:02:04N", 
                                          "HLA-C*01:01"])
ModelConverter.objects2dataframe(mhc1)

Unnamed: 0,alleles,name,zygosity
0,"[{'fullName': 'HLA-A*01:01:02:03N', 'name': 'H...",A,HETEROZYGOUS
1,"[{'fullName': 'HLA-B*01:01:02:03N', 'name': 'H...",B,HOMOZYGOUS
2,"[{'fullName': 'HLA-C*01:01', 'name': 'HLA-C*01...",C,HEMIZYGOUS


In [17]:
ModelConverter.objects2dataframe(mhc1[0].alleles + mhc1[1].alleles + mhc1[2].alleles)

Unnamed: 0,fullName,gene,group,name,protein
0,HLA-A*01:01:02:03N,A,1,HLA-A*01:01,1
1,HLA-A*01:02:02:03N,A,1,HLA-A*01:02,2
2,HLA-B*01:01:02:03N,B,1,HLA-B*01:01,1
3,HLA-C*01:01,C,1,HLA-C*01:01,1


### Validation and normalization of MHC alleles

The data validation layer checks that the provided allele representations are valid.

In [18]:
try:
    ModelConverter.parse_mhc1_alleles(["HLA-W*01:01:02:03N"])  # bad gene W
except NeofoxDataValidationException as e:
    print ("Error message: {}".format(e))

Error message: Gene from MHC I allele is not valid: W


In [19]:
try:
    ModelConverter.parse_mhc1_alleles(["HLA-A*first:second:02:03N"])  # bad allele representation
except NeofoxDataValidationException as e:
    print ("Error message: {}".format(e))

Error message: Allele does not match HLA allele pattern HLA-A*first:second:02:03N


In [20]:
try:
    ModelConverter.parse_mhc1_alleles(["HLA-A*01:02:02:03N", "HLA-A*01:03:02:03N", "HLA-A*01:04:02:03N"])  # wrong number of alleles
except NeofoxDataValidationException as e:
    print ("Error message: {}".format(e))

Error message: More than 2 alleles for gene A


The data normalizatin layer ensures that different representations of the same HLA allele are equivalent in the internal representation.

In [21]:
ModelConverter.parse_mhc1_alleles(["HLA-A*01:02:02:03N", "HLA-A*01:02", "HLA-B*01:01", "HLA-B0101"])

[Mhc1(name=<Mhc1Name.A: 0>, zygosity=<Zygosity.HOMOZYGOUS: 0>, alleles=[MhcAllele(full_name='HLA-A*01:02:02:03N', name='HLA-A*01:02', gene='A', group='01', protein='02')]),
 Mhc1(name=<Mhc1Name.B: 1>, zygosity=<Zygosity.HOMOZYGOUS: 0>, alleles=[MhcAllele(full_name='HLA-B*01:01', name='HLA-B*01:01', gene='B', group='01', protein='01')]),
 Mhc1(name=<Mhc1Name.C: 2>, zygosity=<Zygosity.LOSS: 3>, alleles=[])]

### Parse MHC II alleles into a normal representation

The model for MHC II alleles is more complex as we need to reflect all combinations of alpha and beta chains, but the data validation and normalization provided by NeoFox is fundamentally the same.

Parse a list of MHC II alleles:

In [22]:
mhc2 = ModelConverter.parse_mhc2_alleles(["HLA-DPA1*01:01", "HLA-DPA1*01:02", "HLA-DPB1*01:01", "HLA-DPB1*01:01", 
                                          "HLA-DQA1*01:01", "HLA-DQA1*01:01", "HLA-DQB1*01:01", "HLA-DQB1*01:01", 
                                          "HLA-DRB1*01:01", "HLA-DRB1*01:01"])

An MHC II gene with an heteroyzgous alpha chain and an homozygous beta chain has two isoforms

In [23]:
print(mhc2[1].to_json(indent=2))

{
  "name": "DP",
  "genes": [
    {
      "name": "DPA1",
      "zygosity": "HETEROZYGOUS",
      "alleles": [
        {
          "fullName": "HLA-DPA1*01:01",
          "name": "HLA-DPA1*01:01",
          "gene": "DPA1",
          "group": "01",
          "protein": "01"
        },
        {
          "fullName": "HLA-DPA1*01:02",
          "name": "HLA-DPA1*01:02",
          "gene": "DPA1",
          "group": "01",
          "protein": "02"
        }
      ]
    },
    {
      "name": "DPB1",
      "alleles": [
        {
          "fullName": "HLA-DPB1*01:01",
          "name": "HLA-DPB1*01:01",
          "gene": "DPB1",
          "group": "01",
          "protein": "01"
        }
      ]
    }
  ],
  "isoforms": [
    {
      "name": "HLA-DPA1*01:01-DPB1*01:01",
      "alphaChain": {
        "fullName": "HLA-DPA1*01:01",
        "name": "HLA-DPA1*01:01",
        "gene": "DPA1",
        "group": "01",
        "protein": "01"
      },
      "betaChain": {
        "fullName": "HLA-DP

An MHC II gene with an homozygous alpha and beta chains has a single isoform.

In [24]:
print(mhc2[2].to_json(indent=2))

{
  "name": "DQ",
  "genes": [
    {
      "name": "DQA1",
      "alleles": [
        {
          "fullName": "HLA-DQA1*01:01",
          "name": "HLA-DQA1*01:01",
          "gene": "DQA1",
          "group": "01",
          "protein": "01"
        }
      ]
    },
    {
      "name": "DQB1",
      "alleles": [
        {
          "fullName": "HLA-DQB1*01:01",
          "name": "HLA-DQB1*01:01",
          "gene": "DQB1",
          "group": "01",
          "protein": "01"
        }
      ]
    }
  ],
  "isoforms": [
    {
      "name": "HLA-DQA1*01:01-DQB1*01:01",
      "alphaChain": {
        "fullName": "HLA-DQA1*01:01",
        "name": "HLA-DQA1*01:01",
        "gene": "DQA1",
        "group": "01",
        "protein": "01"
      },
      "betaChain": {
        "fullName": "HLA-DQB1*01:01",
        "name": "HLA-DQB1*01:01",
        "gene": "DQB1",
        "group": "01",
        "protein": "01"
      }
    }
  ]
}


The MHC II DRB gene is a special case with no alpha chain represented as this is not variable.

In [25]:
print(mhc2[0].to_json(indent=2))

{
  "genes": [
    {
      "alleles": [
        {
          "fullName": "HLA-DRB1*01:01",
          "name": "HLA-DRB1*01:01",
          "gene": "DRB1",
          "group": "01",
          "protein": "01"
        }
      ]
    }
  ],
  "isoforms": [
    {
      "name": "HLA-DRB1*01:01",
      "betaChain": {
        "fullName": "HLA-DRB1*01:01",
        "name": "HLA-DRB1*01:01",
        "gene": "DRB1",
        "group": "01",
        "protein": "01"
      }
    }
  ]
}


Beware that incomplete MHC II molecules missing one of the chains will have no isoforms and thus no binding will be computed on them. In the case below the beta chain allele for the DP gene is missing.

In [26]:
mhc2 = ModelConverter.parse_mhc2_alleles(["HLA-DPA1*01:01", "HLA-DPA1*01:02"])
print(mhc2[1].to_json(indent=2))

{
  "name": "DP",
  "genes": [
    {
      "name": "DPA1",
      "zygosity": "HETEROZYGOUS",
      "alleles": [
        {
          "fullName": "HLA-DPA1*01:01",
          "name": "HLA-DPA1*01:01",
          "gene": "DPA1",
          "group": "01",
          "protein": "01"
        },
        {
          "fullName": "HLA-DPA1*01:02",
          "name": "HLA-DPA1*01:02",
          "gene": "DPA1",
          "group": "01",
          "protein": "02"
        }
      ]
    },
    {
      "name": "DPB1",
      "zygosity": "LOSS"
    }
  ]
}


### Create a patient

In [27]:
from neofox.model.neoantigen import Patient


mhc1 = ModelConverter.parse_mhc1_alleles(["HLA-A*01:01:02:03N", "HLA-A*01:02:02:03N", 
                                          "HLA-B*01:01:02:03N", "HLA-B*01:01:02:04N", 
                                          "HLA-C*01:01"])
mhc2 = ModelConverter.parse_mhc2_alleles(["HLA-DPA1*01:01", "HLA-DPA1*01:02", "HLA-DPB1*01:01", "HLA-DPB1*01:01", 
                                          "HLA-DQA1*01:01", "HLA-DQA1*01:01", "HLA-DQB1*01:01", "HLA-DQB1*01:01", 
                                          "HLA-DRB1*01:01", "HLA-DRB1*01:01"])
patient = Patient(
    identifier="P123", 
    is_rna_available=True, 
    tumor_type="NSCLC", 
    mhc1=mhc1,
    mhc2=mhc2
)
ModelConverter.object2series(patient)

identifier                                                       P123
is_rna_available                                                 True
mhc1                [{'name': 'A', 'zygosity': 'HETEROZYGOUS', 'al...
mhc2                [{'name': 'DR', 'genes': [{'name': 'DRB1', 'zy...
tumor_type                                                      NSCLC
Name: 0, dtype: object

### Validate a patient

In [28]:
validated_patient = ModelValidator.validate_patient(patient)

## Run Neofox

### Parse input data from a file

Although we could create the data objects manually as shown above, for convenience it is useful to store the data in tabular format. Here we show how to parse the neoantigens and patients from tabular files. 

The tabular file for neoantigens should look as follows:

In [29]:
pd.read_csv("test_model_file.txt", sep="\t")

Unnamed: 0,transcript.assembly,transcript.gene,transcript.identifier,mutation.leftFlankingRegion,mutation.mutatedAminoacid,mutation.position,mutation.rightFlankingRegion,mutation.wildTypeAminoacid,patientIdentifier
0,hg19,VCAN,uc003kii.3,DEVLGEPSQDILV,T,1007,DQTRLEATISPET,I,Ptx 27
1,hg19,DCST2,uc001fgm.3,RTNLLAALHRSVR,W,564,RAADQGHRSAFLV,R,Ptx 24
2,hg19,NRAS,uc009wgu.3,MTEYKLVVVGA,C,12,GVGKSALTIQLIQ,G,Ptx 28
3,hg19,CEP350,uc001gnt.3,QTDSSSSDMQACS,K,968,DKAKISLGSSIDS,Q,Ptx 63
4,hg19,CPPED1,uc002dca.4,DRAIPLVLVSGNH,Y,129,IGNTPTAETVEEF,D,Ptx 77
5,hg19,CXorf26,uc004ecl.1,YNKAVYISVQDKE,E,167,EKGVNNGGEKRAD,G,Ptx 117
6,hg19,IGSF9B,uc001qgx.4,ASTHLTVIGTSPH,V,512,PGSVRVQVSMTTA,A,Ptx 110
7,hg19,HEATR5A,uc001wrf.4,TRRDEKSHPFTNP,Q,1222,WATRVFAAECVCR,R,Ptx 26
8,hg19,CHRDL2,uc001ovh.3,ARPDMFCLFHGKR,H,40,FPGESWHPYLEPQ,Y,Ptx 77


Parse it into the model objects as follows:

In [30]:
neoantigens, external_annotations = ModelConverter.parse_neoantigens_file("test_model_file.txt")

The tabular file for patients should look as follows:

In [31]:
pd.read_csv("test_patient_file.txt", sep="\t")

Unnamed: 0,identifier,mhcIAlleles,mhcIIAlleles,isRnaAvailable,tumorType
0,Ptx 27,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
1,Ptx 24,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
2,Ptx 28,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
3,Ptx 63,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
4,Ptx 77,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
5,Ptx 117,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
6,Ptx 110,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
7,Ptx 26,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC


Parse the patients into the model objects as follows:

In [32]:
patients = ModelConverter.parse_patients_file("test_patient_file.txt")

### Annotate your neoantigens

Running NeoFox requires the configuration its configuration through a number of environment variables, this is described in detail elsewhere in the documentation.


In [33]:
from neofox.neofox import NeoFox
import os
os.environ["NEOFOX_REFERENCE_FOLDER"] = "/home/priesgo/neofox_install/reference_data_9/"
os.environ["NEOFOX_RSCRIPT"] = "/usr/bin/Rscript"
os.environ["NEOFOX_BLASTP"] = "/home/priesgo/neofox_install/ncbi-blast-2.10.1+/bin/blastp"
os.environ["NEOFOX_NETMHCPAN"] = "/home/priesgo/neofox_install/netMHCpan-4.0/netMHCpan"
os.environ["NEOFOX_NETMHC2PAN"] = "/home/priesgo/neofox_install/netMHCIIpan-3.2/netMHCIIpan"
os.environ["NEOFOX_MIXMHCPRED"] = "/home/priesgo/neofox_install/MixMHCpred-2.1/MixMHCpred"
os.environ["NEOFOX_MIXMHC2PRED"] = "/home/priesgo/neofox_install/MixMHC2pred-1.2/MixMHC2pred_unix"
annotations = NeoFox(neoantigens=neoantigens, patients=patients, num_cpus=4).get_annotations()

[I 201118 11:23:32 neofox:85] Loading data...
[I 201118 11:23:32 references:147] Reference genome folder: /home/priesgo/neofox_install/reference_data_9/
[I 201118 11:23:32 references:148] Resources
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/netmhc2pan_available_alleles.txt
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/netmhcpan_available_alleles.txt
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/iedb
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/proteome_db
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/proteome_db/Homo_sapiens.fa
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/iedb/IEDB.fasta
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/iedb/iedb_blast_db.phr
[I 201118 11:23:32 references:150] /home/priesgo/neofox_install/reference_data_9/iedb/

### Transform the annotations into a data frame

In [34]:
annotations_ts = ModelConverter.annotations2tall_skinny_table(neoantigen_annotations=annotations)
annotations_ts.head(10)

Unnamed: 0,name,value,neoantigen_identifier
0,Expression_mutated_transcript,0,ZHF7I5wQrRBOFIiwJUz8cA==
1,mutation_not_found_in_proteome,1,ZHF7I5wQrRBOFIiwJUz8cA==
2,Best_rank_MHCI_score,6.2429,ZHF7I5wQrRBOFIiwJUz8cA==
3,Best_rank_MHCI_score_epitope,GEPSQDILVT,ZHF7I5wQrRBOFIiwJUz8cA==
4,Best_rank_MHCI_score_allele,HLA-B*44:03,ZHF7I5wQrRBOFIiwJUz8cA==
5,Best_affinity_MHCI_score,3984.4,ZHF7I5wQrRBOFIiwJUz8cA==
6,Best_affinity_MHCI_epitope,ILVTDQTRL,ZHF7I5wQrRBOFIiwJUz8cA==
7,Best_affinity_MHCI_allele,HLA-C*16:01,ZHF7I5wQrRBOFIiwJUz8cA==
8,Best_rank_MHCI_9mer_score,6.2525,ZHF7I5wQrRBOFIiwJUz8cA==
9,Best_rank_MHCI_9mer_epitope,ILVTDQTRL,ZHF7I5wQrRBOFIiwJUz8cA==


In [35]:
annotations_sw = ModelConverter.annotations2short_wide_table(annotations, neoantigens)
annotations_sw

Unnamed: 0,identifier,clonalityEstimation,dnaVariantAlleleFrequency,mutation.leftFlankingRegion,mutation.mutatedAminoacid,mutation.mutatedXmer,mutation.position,mutation.rightFlankingRegion,mutation.sizeLeftFlankingRegion,mutation.sizeRightFlankingRegion,...,MixMHCpred_best_peptide,MixMHCpred_best_score,MixMHCpred_best_rank,MixMHCpred_best_allele,MixMHC2pred_best_peptide,MixMHC2pred_best_rank,MixMHC2pred_best_allele,Dissimilarity_MHCI_cutoff500nM,vaxrank_binding_score,vaxrank_total_score
0,ZHF7I5wQrRBOFIiwJUz8cA==,False,0.0,DEVLGEPSQDILV,T,DEVLGEPSQDILVTDQTRLEATISPET,1007,DQTRLEATISPET,13,13,...,VTDQTRLEA,-0.09792,10,A2902,DEVLGEPSQDILVT,3.06,DPA1_01_03__DPB1_04_01,,0.0,0
1,LFJVKawWbJbtTHk4tYfLDQ==,False,0.0,RTNLLAALHRSVR,W,RTNLLAALHRSVRWRAADQGHRSAFLV,564,RAADQGHRSAFLV,13,13,...,AALHRSVRW,0.17623,2,A2902,TNLLAALHRSVRWR,2.09,DRB1_08_01,,0.11069,0
2,QehAoQpFLF+d0yrDBLAioA==,False,0.0,MTEYKLVVVGA,C,MTEYKLVVVGACGVGKSALTIQLIQ,12,GVGKSALTIQLIQ,11,13,...,VVGACGVGK,0.22889,2,A0301,TEYKLVVVGACGVG,2.01,DRB1_08_01,0.0,1.6547,0
3,nEnsNGVjkWN5jVsOL8Spew==,False,0.0,QTDSSSSDMQACS,K,QTDSSSSDMQACSKDKAKISLGSSIDS,968,DKAKISLGSSIDS,13,13,...,ACSKDKAKISL,0.11544,3,B0702,SSSDMQACSKDKAKIS,2.38,DRB1_08_01,,0.14174,0
4,Hdoq3Q41wRHGYqIXx9UvDg==,False,0.0,DRAIPLVLVSGNH,Y,DRAIPLVLVSGNHYIGNTPTAETVEEF,129,IGNTPTAETVEEF,13,13,...,LVLVSGNHY,0.19414,2,A2902,GNHYIGNTPTAETVEE,5.59,DPA1_01_03__DPB1_04_01,0.0,0.89423,0
5,e5ukn5aW6/8Rm4ow0yAmGw==,False,0.0,YNKAVYISVQDKE,E,YNKAVYISVQDKEEEKGVNNGGEKRAD,167,EKGVNNGGEKRAD,13,13,...,SVQDKEEEK,0.03037,5,A0301,KAVYISVQDKEEEK,1.17,DRB1_08_01,,0.0,0
6,ljeFHtCIS2FkX2qBl3kWkA==,False,0.0,ASTHLTVIGTSPH,V,ASTHLTVIGTSPHVPGSVRVQVSMTTA,512,PGSVRVQVSMTTA,13,13,...,VPGSVRVQV,0.23011,2,B0702,ASTHLTVIGTSPHVPG,9.56,DRB1_08_01,0.0,3.7569,0
7,Tis/Ry2+2i8/KNt3hMAlrg==,False,0.0,TRRDEKSHPFTNP,Q,TRRDEKSHPFTNPQWATRVFAAECVCR,1222,WATRVFAAECVCR,13,13,...,NPQWATRVF,0.21407,2,B0702,TNPQWATRVFAAE,3.56,DRB1_08_01,0.0,2.5678,0
8,zczjOnV4nzE+3u+rGKg2zQ==,False,0.0,ARPDMFCLFHGKR,H,ARPDMFCLFHGKRHFPGESWHPYLEPQ,40,FPGESWHPYLEPQ,13,13,...,CLFHGKRHF,0.18702,2,A2902,KRHFPGESWHPYLE,2.8,DPA1_01_03__DPB1_04_01,0.0,0.92031,0
