# Programmatic usage of NeoFox

NeoFox provides an Application Programming Interface (API) that enables the integration into other applications. This API relies heavily on Protocol Buffers data models that provide placeholder objects to store the required data while enabling different representations, data manipulation, validation and normalization. We use the Protocol Buffers data models to generate Python code automatically and to implement validation and normalization around them, but Protocol Buffers is technology agnostic thus this may facilitate the integration with third party applications not necessarily implemented in Python (see https://developers.google.com/protocol-buffers). The API is tightly integrated with the Python data analysis library Pandas (see https://pandas.pydata.org/).

Here we show: 

* how to create new model objects
* how to import/export these objects into different representations
* how to manipulate them
* how to validate and normalize on the data 

And finally we show how to run NeoFox programmatically, you may want to skip to this part for a quick grasp of the API usage.

## Neoantigens

The neoantigen is the central piece of information that NeoFox handles, all output annotations refer to a neoantigen. A neoantigen is formed by two subentities transcript and mutation, plus some additional attributes. Here we show how to create a neoantigen, transform it into different representations and validate it.

### Create a neoantigen


In [1]:
from neofox.model.neoantigen import Neoantigen, Mutation

Create a mutation:

In [2]:
mutation = Mutation(
    wild_type_xmer="DEVLGEPSQDILVIDQTRLEATISPET", 
    mutated_xmer="DEVLGEPSQDILVTDQTRLEATISPET")

Create a neoantigen using the previous transcript and mutation:

In [3]:
neoantigen = Neoantigen(
    patient_identifier="P123", 
    mutation=mutation, 
    gene="VCAN", 
    rna_expression=0.519506894, 
    rna_variant_allele_frequency=0.857142857, 
    dna_variant_allele_frequency=0.294573643)

### Representation into different formats

The same piece of data agreeing with NeoFox data models can be represented in different formats. Here we show how to transform the data between several formats: JSON, Python dictionaries, Protocol Buffers binary representations, Pandas dataframes and tabular representations in files. This is relevant for enabling data import and export and adding flexibility to the integration with other tools.

What is shown here is applicable to all entities in NeoFox data models.

These objects can be easily transformed into JSON:

In [4]:
print(neoantigen.to_json(indent=2))

{
  "patientIdentifier": "P123",
  "gene": "VCAN",
  "mutation": {
    "wildTypeXmer": "DEVLGEPSQDILVIDQTRLEATISPET",
    "mutatedXmer": "DEVLGEPSQDILVTDQTRLEATISPET"
  },
  "rnaExpression": 0.519506894,
  "dnaVariantAlleleFrequency": 0.294573643,
  "rnaVariantAlleleFrequency": 0.857142857
}


They can also be transformed into a Python native dictionary:

In [5]:
neoantigen.to_dict()

{'patientIdentifier': 'P123',
 'gene': 'VCAN',
 'mutation': {'wildTypeXmer': 'DEVLGEPSQDILVIDQTRLEATISPET',
  'mutatedXmer': 'DEVLGEPSQDILVTDQTRLEATISPET'},
 'rnaExpression': 0.519506894,
 'dnaVariantAlleleFrequency': 0.294573643,
 'rnaVariantAlleleFrequency': 0.857142857}

And also into the Protocol Buffers binary format that allows a better compression for storing the data or sending it over the wire:

In [6]:
neoantigen.SerializeToString()

b'\x12\x04P123\x1a\x04VCAN":\x12\x1bDEVLGEPSQDILVIDQTRLEATISPET\x1a\x1bDEVLGEPSQDILVTDQTRLEATISPET-g\xfe\x04?5[\xd2\x96>=\xb7m[?'

### Integration with Pandas

NeoFox integrates with the Python library for data analysis Pandas (see https://pandas.pydata.org/). A single object can be transformed into a Pandas `Series` and a list of objects can be transformed into a Pandas `DataFrame`. Pandas provide functionality to persist this tabular representations to files that can be stored and imported into other environments, for instance R.

What is shown here is applicable to all entities in NeoFox data models.

In [7]:
import pandas as pd
from neofox.model.conversion import ModelConverter

Transform a neoantigen into a Pandas `Series`:

In [8]:
neoantigen_series = ModelConverter.object2series(neoantigen)
neoantigen_series

dna_variant_allele_frequency                       0.294574
gene                                                   VCAN
identifier                                                 
mutation.mutated_xmer           DEVLGEPSQDILVTDQTRLEATISPET
mutation.position                                        []
mutation.wild_type_xmer         DEVLGEPSQDILVIDQTRLEATISPET
patient_identifier                                     P123
rna_expression                                     0.519507
rna_variant_allele_frequency                       0.857143
Name: 0, dtype: object

Transform a list of transcripts into a Pandas `DataFrame`:

In [9]:
mutation2 = Mutation(
    wild_type_xmer="AAAAAAAAAAAAAAAAAAAAAAAAAAA", 
    mutated_xmer="AAAAAAAAAAAAAGAAAAAAAAAAAAA")
mutations_df = ModelConverter.objects2dataframe([mutation, mutation2])
mutations_df

Unnamed: 0,mutatedXmer,position,wildTypeXmer
0,DEVLGEPSQDILVTDQTRLEATISPET,[],DEVLGEPSQDILVIDQTRLEATISPET
1,AAAAAAAAAAAAAGAAAAAAAAAAAAA,[],AAAAAAAAAAAAAAAAAAAAAAAAAAA


Persist any Pandas object into a file:

In [10]:
mutations_df.to_csv("my_mutations.csv", sep="\t", index=False)

And read it back:

In [11]:
mutations_df2 = pd.read_csv("my_mutations.csv", sep="\t")
mutations = []
for _, row in mutations_df2.iterrows():
    #nested_dict = ModelConverter._flat_dict2nested_dict(flat_dict=row.to_dict())
    mutations.append(Mutation().from_dict(row.to_dict()))
#ModelConverter.objects2dataframe(mutations)
mutations

[Mutation(position='[]', wild_type_xmer='DEVLGEPSQDILVIDQTRLEATISPET', mutated_xmer='DEVLGEPSQDILVTDQTRLEATISPET'),
 Mutation(position='[]', wild_type_xmer='AAAAAAAAAAAAAAAAAAAAAAAAAAA', mutated_xmer='AAAAAAAAAAAAAGAAAAAAAAAAAAA')]

In some cases you will may be handling nested objects, for instance a neoantigen. The nesting is flattened into the DataFrame by concatenating field names with a dot, eg: `mutation.wild_type_xmer`. In order to read the flattened data back into the nested models we need to add an intermediate step.

In [12]:
# the flattened dictionary
neoantigen_series.to_dict()

{'dna_variant_allele_frequency': 0.294573643,
 'gene': 'VCAN',
 'identifier': '',
 'mutation.mutated_xmer': 'DEVLGEPSQDILVTDQTRLEATISPET',
 'mutation.position': [],
 'mutation.wild_type_xmer': 'DEVLGEPSQDILVIDQTRLEATISPET',
 'patient_identifier': 'P123',
 'rna_expression': 0.519506894,
 'rna_variant_allele_frequency': 0.857142857}

In [13]:
# the nested dictionary
ModelConverter._flat_dict2nested_dict(flat_dict=neoantigen_series.to_dict())

{'dna_variant_allele_frequency': 0.294573643,
 'gene': 'VCAN',
 'identifier': '',
 'mutation': {'mutated_xmer': 'DEVLGEPSQDILVTDQTRLEATISPET',
  'position': [],
  'wild_type_xmer': 'DEVLGEPSQDILVIDQTRLEATISPET'},
 'patient_identifier': 'P123',
 'rna_expression': 0.519506894,
 'rna_variant_allele_frequency': 0.857142857}

In [14]:
# we can load the nested dictionary into a nested model object
Neoantigen().from_dict(ModelConverter._flat_dict2nested_dict(flat_dict=neoantigen_series.to_dict()))

Neoantigen(identifier='', patient_identifier='P123', gene='VCAN', mutation=Mutation(position=[], wild_type_xmer='DEVLGEPSQDILVIDQTRLEATISPET', mutated_xmer='DEVLGEPSQDILVTDQTRLEATISPET'), rna_expression=0.519506894, dna_variant_allele_frequency=0.294573643, rna_variant_allele_frequency=0.857142857)

### Data validation

The quality and cleanliness of data is of great importance to enable an effective data analysis and make the data machine readable. Clean data means that the data is valid and that it is in a normal and homogeneous form. The use of controlled vocabularies help to represent knowledge in a standardised way. This is a domain specific task, although it can be assisted with the right tools such as Pandas in Python or tidyverse in R, it requires domain expertise to perform it. NeoFox provides this domain expertise out of the box with its validation and normalization layers on top of its data models.

In [15]:
from neofox.model.conversion import ModelValidator
from neofox.exceptions import NeofoxDataValidationException

The data validation checks for missing required fields and shows relevant messages.

In [16]:
try:
    ModelValidator.validate_neoantigen(neoantigen=Neoantigen())
except NeofoxDataValidationException as e:
    print("Error message: {}".format(e))

Error message: Missing patient identifier on neoantigen


It also performs more domain specific validations such as aminoacids being valid according to the IUPAC standard aminoacid representation.

In [17]:
try:
    ModelValidator.validate_neoantigen(neoantigen=Neoantigen(
        patient_identifier="12345",
        gene="VCAN",
        mutation=Mutation(
            wild_type_xmer="123456AAAAAAAAAAAAAA", # wrong aminoacid representation
            mutated_xmer="123456GAAAAAAAAAAAAA")))
except NeofoxDataValidationException as e:
    print("Error message: {}".format(e))

Error message: Non existing aminoacid 1


The data normalization layer ensures the aminoacid representation is normalized into 1 letter IUPAC codes.

In [18]:
valid_neoantigen = ModelValidator.validate_neoantigen(neoantigen=Neoantigen(
    patient_identifier="12345",
    mutation=Mutation(
        wild_type_xmer="AAAAAAAAAAAAA",
        mutated_xmer="aaaaaGaaaaa")))

print(valid_neoantigen.mutation.to_json(indent=2))

{
  "position": [
    6
  ],
  "wildTypeXmer": "AAAAAAAAAAAAA",
  "mutatedXmer": "AAAAAGAAAAA"
}


After validation a unique neoantigen identifier is generated, this is a hash function of the normalized neoantigen representation, thus two different representations of the same neoantigen will share the same identifier after normalization.

In [19]:
validated_neoantigen = ModelValidator.validate_neoantigen(neoantigen=neoantigen)
print(validated_neoantigen.to_json(indent=2))

{
  "identifier": "v9cO6QEVw9XjrpJlvJ3isw==",
  "patientIdentifier": "P123",
  "gene": "VCAN",
  "mutation": {
    "position": [
      14
    ],
    "wildTypeXmer": "DEVLGEPSQDILVIDQTRLEATISPET",
    "mutatedXmer": "DEVLGEPSQDILVTDQTRLEATISPET"
  },
  "rnaExpression": 0.519506894,
  "dnaVariantAlleleFrequency": 0.294573643,
  "rnaVariantAlleleFrequency": 0.857142857
}


## Patients

The neoantigen annotation process needs some context information, in particular some data about the individual where the somatic mutation creating this neoantigen took place. This information includes mainly the HLA types of the patient which is needed to compute the binding of the potential neoepitopes.

### Parse MHC I alleles into a normal representation

The main complexity in the patient model is the representation of the MHC I and MHC II alleles present in the patient. The HLA alleles are typically represented using the nomenclature defined here http://hla.alleles.org, but de facto there is certain flexibility in the representation of HLA alleles in the community. NeoFox aims at normalizing the different HLA representations into a controlled representation agreeing with the HLA nomenclature. NeoFox only supports the classic MHC genes and although the provided HLA type is kept internally it only works with the first 4 digits.

There are specific functions in NeoFox to parse a list of non normal HLA alleles into a normalized representation of the HLA alleles. Furthermore, the zygosity of each HLA gene is inferred.

Parse a list of MHC I alleles. The data validation will ensure that the data is valid and it will infer the zygosity of the different genes. The data normalization layer will normalize the HLA representation into a the valid HLA nomenclature including the first 4 digits.

In [20]:
mhc1 = ModelConverter.parse_mhc1_alleles(["HLA-A*01:01:02:03N", "HLA-A*01:02:02:03N", 
                                          "HLA-B*01:01:02:03N", "HLA-B*01:01:02:04N", 
                                          "HLA-C*01:01"])
ModelConverter.objects2dataframe(mhc1)

Unnamed: 0,alleles,name,zygosity
0,"[{'fullName': 'HLA-A*01:01:02:03N', 'name': 'H...",A,HETEROZYGOUS
1,"[{'fullName': 'HLA-B*01:01:02:03N', 'name': 'H...",B,HOMOZYGOUS
2,"[{'fullName': 'HLA-C*01:01', 'name': 'HLA-C*01...",C,HEMIZYGOUS


In [21]:
ModelConverter.objects2dataframe(mhc1[0].alleles + mhc1[1].alleles + mhc1[2].alleles)

Unnamed: 0,fullName,gene,group,name,protein
0,HLA-A*01:01:02:03N,A,1,HLA-A*01:01,1
1,HLA-A*01:02:02:03N,A,1,HLA-A*01:02,2
2,HLA-B*01:01:02:03N,B,1,HLA-B*01:01,1
3,HLA-C*01:01,C,1,HLA-C*01:01,1


### Validation and normalization of MHC alleles

The data validation layer checks that the provided allele representations are valid.

In [22]:
try:
    ModelConverter.parse_mhc1_alleles(["HLA-W*01:01:02:03N"])  # bad gene W
except NeofoxDataValidationException as e:
    print ("Error message: {}".format(e))

Error message: Allele does not match HLA allele pattern HLA-W*01:01:02:03N


In [23]:
try:
    ModelConverter.parse_mhc1_alleles(["HLA-A*first:second:02:03N"])  # bad allele representation
except NeofoxDataValidationException as e:
    print ("Error message: {}".format(e))

Error message: Allele does not match HLA allele pattern HLA-A*first:second:02:03N


In [24]:
try:
    ModelConverter.parse_mhc1_alleles(["HLA-A*01:02:02:03N", "HLA-A*01:03:02:03N", "HLA-A*01:04:02:03N"])  # wrong number of alleles
except NeofoxDataValidationException as e:
    print ("Error message: {}".format(e))

Error message: More than 2 alleles for gene A


The data normalizatin layer ensures that different representations of the same HLA allele are equivalent in the internal representation.

In [25]:
ModelConverter.parse_mhc1_alleles(["HLA-A*01:02:02:03N", "HLA-A*01:02", "HLA-B*01:01", "HLA-B01_01"])

[Mhc1(name=<Mhc1Name.A: 0>, zygosity=<Zygosity.HOMOZYGOUS: 0>, alleles=[MhcAllele(full_name='HLA-A*01:02:02:03N', name='HLA-A*01:02', gene='A', group='01', protein='02')]),
 Mhc1(name=<Mhc1Name.B: 1>, zygosity=<Zygosity.HOMOZYGOUS: 0>, alleles=[MhcAllele(full_name='HLA-B*01:01', name='HLA-B*01:01', gene='B', group='01', protein='01')]),
 Mhc1(name=<Mhc1Name.C: 2>, zygosity=<Zygosity.LOSS: 3>, alleles=[])]

### Parse MHC II alleles into a normal representation

The model for MHC II alleles is more complex as we need to reflect all combinations of alpha and beta chains, but the data validation and normalization provided by NeoFox is fundamentally the same.

Parse a list of MHC II alleles:

In [26]:
mhc2 = ModelConverter.parse_mhc2_alleles(["HLA-DPA1*01:01", "HLA-DPA1*01:02", "HLA-DPB1*01:01", "HLA-DPB1*01:01", 
                                          "HLA-DQA1*01:01", "HLA-DQA1*01:01", "HLA-DQB1*01:01", "HLA-DQB1*01:01", 
                                          "HLA-DRB1*01:01", "HLA-DRB1*01:01"])

An MHC II gene with an heteroyzgous alpha chain and an homozygous beta chain has two isoforms

In [27]:
mhc2[1].to_dict()

{'name': 'DP',
 'genes': [{'name': 'DPA1',
   'zygosity': 'HETEROZYGOUS',
   'alleles': [{'fullName': 'HLA-DPA1*01:01',
     'name': 'HLA-DPA1*01:01',
     'gene': 'DPA1',
     'group': '01',
     'protein': '01'},
    {'fullName': 'HLA-DPA1*01:02',
     'name': 'HLA-DPA1*01:02',
     'gene': 'DPA1',
     'group': '01',
     'protein': '02'}]},
  {'name': 'DPB1',
   'alleles': [{'fullName': 'HLA-DPB1*01:01',
     'name': 'HLA-DPB1*01:01',
     'gene': 'DPB1',
     'group': '01',
     'protein': '01'}]}],
 'isoforms': [{'name': 'HLA-DPA1*01:01-DPB1*01:01',
   'alphaChain': {'fullName': 'HLA-DPA1*01:01',
    'name': 'HLA-DPA1*01:01',
    'gene': 'DPA1',
    'group': '01',
    'protein': '01'},
   'betaChain': {'fullName': 'HLA-DPB1*01:01',
    'name': 'HLA-DPB1*01:01',
    'gene': 'DPB1',
    'group': '01',
    'protein': '01'}},
  {'name': 'HLA-DPA1*01:02-DPB1*01:01',
   'alphaChain': {'fullName': 'HLA-DPA1*01:02',
    'name': 'HLA-DPA1*01:02',
    'gene': 'DPA1',
    'group': '01',
   

An MHC II gene with an homozygous alpha and beta chains has a single isoform.

In [28]:
mhc2[2].to_dict()

{'name': 'DQ',
 'genes': [{'name': 'DQA1',
   'alleles': [{'fullName': 'HLA-DQA1*01:01',
     'name': 'HLA-DQA1*01:01',
     'gene': 'DQA1',
     'group': '01',
     'protein': '01'}]},
  {'name': 'DQB1',
   'alleles': [{'fullName': 'HLA-DQB1*01:01',
     'name': 'HLA-DQB1*01:01',
     'gene': 'DQB1',
     'group': '01',
     'protein': '01'}]}],
 'isoforms': [{'name': 'HLA-DQA1*01:01-DQB1*01:01',
   'alphaChain': {'fullName': 'HLA-DQA1*01:01',
    'name': 'HLA-DQA1*01:01',
    'gene': 'DQA1',
    'group': '01',
    'protein': '01'},
   'betaChain': {'fullName': 'HLA-DQB1*01:01',
    'name': 'HLA-DQB1*01:01',
    'gene': 'DQB1',
    'group': '01',
    'protein': '01'}}]}

The MHC II DRB gene is a special case with no alpha chain represented as this is not variable.

In [29]:
mhc2[0].to_dict()

{'genes': [{'alleles': [{'fullName': 'HLA-DRB1*01:01',
     'name': 'HLA-DRB1*01:01',
     'gene': 'DRB1',
     'group': '01',
     'protein': '01'}]}],
 'isoforms': [{'name': 'HLA-DRB1*01:01',
   'betaChain': {'fullName': 'HLA-DRB1*01:01',
    'name': 'HLA-DRB1*01:01',
    'gene': 'DRB1',
    'group': '01',
    'protein': '01'}}]}

Beware that incomplete MHC II molecules missing one of the chains will have no isoforms and thus no binding will be computed on them. In the case below the beta chain allele for the DP gene is missing.

In [30]:
mhc2 = ModelConverter.parse_mhc2_alleles(["HLA-DPA1*01:01", "HLA-DPA1*01:02"])
mhc2[1].to_dict()

{'name': 'DP',
 'genes': [{'name': 'DPA1',
   'zygosity': 'HETEROZYGOUS',
   'alleles': [{'fullName': 'HLA-DPA1*01:01',
     'name': 'HLA-DPA1*01:01',
     'gene': 'DPA1',
     'group': '01',
     'protein': '01'},
    {'fullName': 'HLA-DPA1*01:02',
     'name': 'HLA-DPA1*01:02',
     'gene': 'DPA1',
     'group': '01',
     'protein': '02'}]},
  {'name': 'DPB1', 'zygosity': 'LOSS'}]}

### Create a patient

In [31]:
from neofox.model.neoantigen import Patient


mhc1 = ModelConverter.parse_mhc1_alleles(["HLA-A*01:01:02:03N", "HLA-A*01:02:02:03N", 
                                          "HLA-B*01:01:02:03N", "HLA-B*01:01:02:04N", 
                                          "HLA-C*01:01"])
mhc2 = ModelConverter.parse_mhc2_alleles(["HLA-DPA1*01:01", "HLA-DPA1*01:02", "HLA-DPB1*01:01", "HLA-DPB1*01:01", 
                                          "HLA-DQA1*01:01", "HLA-DQA1*01:01", "HLA-DQB1*01:01", "HLA-DQB1*01:01", 
                                          "HLA-DRB1*01:01", "HLA-DRB1*01:01"])
patient = Patient(
    identifier="P123", 
    is_rna_available=True, 
    tumor_type="NSCLC", 
    mhc1=mhc1,
    mhc2=mhc2
)
ModelConverter.object2series(patient)

identifier                                                       P123
is_rna_available                                                 True
mhc1                [{'name': 'A', 'zygosity': 'HETEROZYGOUS', 'al...
mhc2                [{'name': 'DR', 'genes': [{'name': 'DRB1', 'zy...
tumor_type                                                      NSCLC
Name: 0, dtype: object

### Validate a patient

In [32]:
validated_patient = ModelValidator.validate_patient(patient)

A patient requires an identifier. MHC I and MHC II are optional in case one or the other are not available, the output annotations are adapted accordingly.

In [33]:
try:
    ModelValidator.validate_patient(Patient())  # missing patient identifier
except NeofoxDataValidationException as e:
    print ("Error message: {}".format(e))

Error message: Patient identifier is empty


In [34]:
patient_without_mhc2 = ModelValidator.validate_patient(Patient(identifier="12345", mhc1=mhc1))

In [35]:
patient_without_mhc1 = ModelValidator.validate_patient(Patient(identifier="12345", mhc2=mhc2))

## Run Neofox

### Parse input data from a file

Although we could create the data objects manually as shown above, for convenience it is useful to store the data in tabular format. Here we show how to parse the neoantigens and patients from tabular files. 

The tabular file for neoantigens should look as follows:

In [36]:
pd.read_csv("test_model_file.txt", sep="\t")

Unnamed: 0,gene,transcript_identifier,mutation.mutatedXmer,mutation.wildTypeXmer,patientIdentifier
0,VCAN,uc003kii.3,DEVLGEPSQDILVTDQTRLEATISPET,DEVLGEPSQDILVIDQTRLEATISPET,Ptx 27
1,DCST2,uc001fgm.3,RTNLLAALHRSVRWRAADQGHRSAFLV,RTNLLAALHRSVRRRAADQGHRSAFLV,Ptx 24
2,NRAS,uc009wgu.3,MTEYKLVVVGACGVGKSALTIQLIQ,MTEYKLVVVGAGGVGKSALTIQLIQ,Ptx 28
3,CEP350,uc001gnt.3,QTDSSSSDMQACSKDKAKISLGSSIDS,QTDSSSSDMQACSQDKAKISLGSSIDS,Ptx 63
4,CPPED1,uc002dca.4,DRAIPLVLVSGNHYIGNTPTAETVEEF,DRAIPLVLVSGNHDIGNTPTAETVEEF,Ptx 77
5,CXorf26,uc004ecl.1,YNKAVYISVQDKEEEKGVNNGGEKRAD,YNKAVYISVQDKEGEKGVNNGGEKRAD,Ptx 117
6,IGSF9B,uc001qgx.4,ASTHLTVIGTSPHVPGSVRVQVSMTTA,ASTHLTVIGTSPHAPGSVRVQVSMTTA,Ptx 110
7,HEATR5A,uc001wrf.4,TRRDEKSHPFTNPQWATRVFAAECVCR,TRRDEKSHPFTNPRWATRVFAAECVCR,Ptx 26
8,CHRDL2,uc001ovh.3,ARPDMFCLFHGKRHFPGESWHPYLEPQ,ARPDMFCLFHGKRYFPGESWHPYLEPQ,Ptx 77


There is a specific function to parse an input file into a list of neoantigens. Any additional column not matching a field in the neoantigens model, in this case `transcript_identifier`, will be parsed into the external annotations. Neofox when executed from the command line interface adds these external annotations in the output together with the new annotations.

In [37]:
neoantigens, external_annotations = ModelConverter.parse_neoantigens_file("test_model_file.txt")

The tabular file for patients should look as follows:

In [38]:
pd.read_csv("test_patient_file.txt", sep="\t")

Unnamed: 0,identifier,mhcIAlleles,mhcIIAlleles,isRnaAvailable,tumorType
0,Ptx 27,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
1,Ptx 24,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
2,Ptx 28,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
3,Ptx 63,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
4,Ptx 77,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
5,Ptx 117,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
6,Ptx 110,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC
7,Ptx 26,"HLA-A*03:01,HLA-A*29:02,HLA-B*07:02,HLA-B*44:0...","HLA-DRB1*04:02,HLA-DRB1*08:01,HLA-DQA1*03:01,H...",True,HNSC


Parse the patients into the model objects as follows:

In [39]:
patients = ModelConverter.parse_patients_file("test_patient_file.txt")

### Annotate your neoantigens

Running NeoFox requires the configuration its configuration through a number of environment variables, this is described in detail elsewhere in the documentation. This configuration can also be provided through a file passed into Neofox class in the field `configuration_file`.

In [40]:
from neofox.neofox import NeoFox
import os

In [41]:
os.environ["NEOFOX_REFERENCE_FOLDER"] = "/home/priesgo/neofox_install/reference_data_9/"
os.environ["NEOFOX_RSCRIPT"] = "/usr/bin/Rscript"
os.environ["NEOFOX_BLASTP"] = "/neofox_install/ncbi-blast-2.10.1+/bin/blastp"
os.environ["NEOFOX_NETMHCPAN"] = "/neofox_install/netMHCpan-4.0/netMHCpan"
os.environ["NEOFOX_NETMHC2PAN"] = "/neofox_install/netMHCIIpan-3.2/netMHCIIpan"
os.environ["NEOFOX_MIXMHCPRED"] = "/neofox_install/MixMHCpred-2.1/MixMHCpred"
os.environ["NEOFOX_MIXMHC2PRED"] = "/neofox_install/MixMHC2pred-1.2/MixMHC2pred_unix"
annotations = NeoFox(neoantigens=neoantigens, patients=patients, num_cpus=4).get_annotations()

[I 201211 07:53:20 neofox:139] Loading data...
[I 201211 07:53:20 references:162] Reference genome folder: /home/priesgo/neofox_install/reference_data_9/
[I 201211 07:53:20 references:163] Resources
[I 201211 07:53:20 references:165] /home/priesgo/neofox_install/reference_data_9/netmhc2pan_available_alleles.txt
[I 201211 07:53:20 references:165] /home/priesgo/neofox_install/reference_data_9/netmhcpan_available_alleles.txt
[I 201211 07:53:20 references:165] /home/priesgo/neofox_install/reference_data_9/iedb
[I 201211 07:53:20 references:165] /home/priesgo/neofox_install/reference_data_9/proteome_db
[I 201211 07:53:20 references:165] /home/priesgo/neofox_install/reference_data_9/proteome_db/Homo_sapiens.fa
[I 201211 07:53:20 references:165] /home/priesgo/neofox_install/reference_data_9/iedb/IEDB.fasta
[I 201211 07:53:20 references:165] /home/priesgo/neofox_install/reference_data_9/proteome_db/Homo_sapiens.fa
[I 201211 07:53:20 neofox:110] Data loaded
[I 201211 07:53:20 neofox:181] Starti

Neofox returns a list of annotations for each neoantigen, these are stored in an object called `NeoantigenAnnotations` which contains the corresponding neoantigen identifier, the annotator (ie: neofox), the annotator version, a timestamp and finally a list of the annotations.

In [42]:
annotations[0].to_dict()

{'neoantigenIdentifier': 'xfXqjXFHfWOR8wSnvrAXRQ==',
 'annotations': [{'name': 'Best_rank_MHCI_score', 'value': '6.2429'},
  {'name': 'Best_rank_MHCI_score_epitope', 'value': 'GEPSQDILVT'},
  {'name': 'Best_rank_MHCI_score_allele', 'value': 'HLA-B*44:03'},
  {'name': 'Best_affinity_MHCI_score', 'value': '3984.4'},
  {'name': 'Best_affinity_MHCI_epitope', 'value': 'ILVTDQTRL'},
  {'name': 'Best_affinity_MHCI_allele', 'value': 'HLA-C*16:01'},
  {'name': 'Best_rank_MHCI_9mer_score', 'value': '6.2525'},
  {'name': 'Best_rank_MHCI_9mer_epitope', 'value': 'ILVTDQTRL'},
  {'name': 'Best_rank_MHCI_9mer_allele', 'value': 'HLA-C*16:01'},
  {'name': 'Best_affinity_MHCI_9mer_score', 'value': '3984.4'},
  {'name': 'Best_affinity_MHCI_9mer_allele', 'value': 'HLA-C*16:01'},
  {'name': 'Best_affinity_MHCI_9mer_epitope', 'value': 'ILVTDQTRL'},
  {'name': 'Best_affinity_MHCI_score_WT', 'value': '4474'},
  {'name': 'Best_affinity_MHCI_epitope_WT', 'value': 'ILVIDQTRL'},
  {'name': 'Best_affinity_MHCI_all

### Transform the annotations into a data frame

In [43]:
annotations_ts = ModelConverter.annotations2tall_skinny_table(neoantigen_annotations=annotations)
annotations_ts.head(10)

Unnamed: 0,name,neoantigen_identifier,value
0,Best_rank_MHCI_score,xfXqjXFHfWOR8wSnvrAXRQ==,6.2429
1,Best_rank_MHCI_score_epitope,xfXqjXFHfWOR8wSnvrAXRQ==,GEPSQDILVT
2,Best_rank_MHCI_score_allele,xfXqjXFHfWOR8wSnvrAXRQ==,HLA-B*44:03
3,Best_affinity_MHCI_score,xfXqjXFHfWOR8wSnvrAXRQ==,3984.4
4,Best_affinity_MHCI_epitope,xfXqjXFHfWOR8wSnvrAXRQ==,ILVTDQTRL
5,Best_affinity_MHCI_allele,xfXqjXFHfWOR8wSnvrAXRQ==,HLA-C*16:01
6,Best_rank_MHCI_9mer_score,xfXqjXFHfWOR8wSnvrAXRQ==,6.2525
7,Best_rank_MHCI_9mer_epitope,xfXqjXFHfWOR8wSnvrAXRQ==,ILVTDQTRL
8,Best_rank_MHCI_9mer_allele,xfXqjXFHfWOR8wSnvrAXRQ==,HLA-C*16:01
9,Best_affinity_MHCI_9mer_score,xfXqjXFHfWOR8wSnvrAXRQ==,3984.4


In [44]:
annotations_sw = ModelConverter.annotations2short_wide_table(annotations, neoantigens)
annotations_sw

Unnamed: 0,identifier,dnaVariantAlleleFrequency,gene,mutation.mutatedXmer,mutation.position,mutation.wildTypeXmer,patientIdentifier,rnaExpression,rnaVariantAlleleFrequency,ADN_MHCI,...,PHBR-I,PHBR-II,Pathogensimiliarity_MHCI_affinity_9mer,Priority_score,Recognition_Potential_MHCI_affinity_9mer,Selfsimilarity_MHCI_conserved_binder,Tcell_predictor_score_cutoff500nM,mutation_not_found_in_proteome,vaxrank_binding_score,vaxrank_total_score
0,xfXqjXFHfWOR8wSnvrAXRQ==,0.0,VCAN,DEVLGEPSQDILVTDQTRLEATISPET,14,DEVLGEPSQDILVIDQTRLEATISPET,Ptx 27,0.0,0.0,0,...,9.2247,37.098,0,0,,0.98674597520596,,1,0.0,0
1,6CSxIpO9Iomh0GGDtpTfSw==,0.0,DCST2,RTNLLAALHRSVRWRAADQGHRSAFLV,14,RTNLLAALHRSVRRRAADQGHRSAFLV,Ptx 24,0.0,0.0,0,...,3.4198,1.058,0,0,,0.9421875787623096,,1,0.11069,0
2,Ttu8puMmQmrd2XJkcR93gw==,0.0,NRAS,MTEYKLVVVGACGVGKSALTIQLIQ,12,MTEYKLVVVGAGGVGKSALTIQLIQ,Ptx 28,0.0,0.0,0,...,2.4762,35.345,0,0,0.0,0.9330521460001094,0.5068878716790075,1,1.6547,0
3,yz2rwB4gsW0OpopTUOx90g==,0.0,CEP350,QTDSSSSDMQACSKDKAKISLGSSIDS,14,QTDSSSSDMQACSQDKAKISLGSSIDS,Ptx 63,0.0,0.0,0,...,4.7855,57.667,0,0,,,,1,0.14174,0
4,66+h9DAxsGSt6WJIcpNHyA==,0.0,CPPED1,DRAIPLVLVSGNHYIGNTPTAETVEEF,14,DRAIPLVLVSGNHDIGNTPTAETVEEF,Ptx 77,0.0,0.0,1,...,1.4605,7.4182,0,0,0.0,,0.5720939101479084,1,0.89423,0
5,1rfhKuRqfyii+dlMgNSnUg==,0.0,CXorf26,YNKAVYISVQDKEEEKGVNNGGEKRAD,14,YNKAVYISVQDKEGEKGVNNGGEKRAD,Ptx 117,0.0,0.0,0,...,15.021,29.186,0,0,,0.954217557594994,,1,0.0,0
6,ktvVuVby7Yj7kt2e7D2xtg==,0.0,IGSF9B,ASTHLTVIGTSPHVPGSVRVQVSMTTA,14,ASTHLTVIGTSPHAPGSVRVQVSMTTA,Ptx 110,0.0,0.0,0,...,0.93399,24.362,0,0,0.0,0.9649973487719512,0.0991994065427799,1,3.7569,0
7,+Qe3EvwMGD4jqKx9y49zrA==,0.0,HEATR5A,TRRDEKSHPFTNPQWATRVFAAECVCR,14,TRRDEKSHPFTNPRWATRVFAAECVCR,Ptx 26,0.0,0.0,0,...,0.74995,1.262,0,0,0.0,0.9614920042660836,0.3374084622586067,1,2.5678,0
8,2ZnOC3PFamYcHl/V7xCYYg==,0.0,CHRDL2,ARPDMFCLFHGKRHFPGESWHPYLEPQ,14,ARPDMFCLFHGKRYFPGESWHPYLEPQ,Ptx 77,0.0,0.0,0,...,1.6364,31.649,0,0,,0.9782021524274525,,1,0.92031,0
