<h1> Polymorphic Epitope Prediction </h1>

This tutorial illustrates how to use Fred2 to integrate genomic information of a patient for epitope prediction

This tutorial entails:
- Variant construction
- Polymorphic Transcript/Protein/Peptide construction
- Polymorphic epitope prediction


In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

from Fred2.Core import Allele, Peptide, Protein,generate_peptides_from_proteins
from Fred2.IO import read_lines, read_fasta
from Fred2.EpitopePrediction import EpitopePredictorFactory
from Fred2.Core import generate_transcripts_from_variants, generate_proteins_from_transcripts 
from Fred2.Core import generate_peptides_from_variants
from Fred2.IO import read_annovar_exonic
from Fred2.IO import MartsAdapter
from Fred2.IO import EIdentifierTypes

<h2> Chapter 1: Generating polymorphic epitopes </h2>
<br/>
We first have to construct variants to work with. We either can do this manually or by using one of the IO functions of FRED2. Currently, FRED2 supports annotated exonic ANNOVAR files. Once the variant objects are created, we can use them to construct polymorphic transcripts with `Fred2.Core.generate_transcript_from_variants`. For that we also have to specify from which database the additional transcript information (like sequence etc.) should be extracted and what kind of identification system (e.g. RefSeq, ENSEMBLE) was used to annotate the variants. Here we use the `Fred2.IO.MatrsAdapter` to connect to the remote BioMart DB and use `RefSeq` as indetification system via specifying this with `Fred2.IO.EIdentifierTypes.REFSEQ` We also can specify which of the community BioMart DB sued be used instead of the central BioMart server with the named argument `biomart=`.<br/>

`Fred2.Core.generate_transcript_from_variants` will generate all combinatorial possibilities of heterozygous and homozygous variants and incorporate these into the reference transcript sequence. This also means that the function becomes quickly unpractical once a large amount of heterozygous variants should be processed. $n$ heterozygous variants will generate $2^n$ transcript objects. This procedure is done, since no phasing information of the heterozygous variants are routinely available.


In [3]:
vars = read_annovar_exonic("./data/test_annovar.out")
mart =  MartsAdapter(biomart="http://grch37.ensembl.org/biomart/martservice?query=")
trans = generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ)

Once we generated the polymorphic transcripts, we can use `Fred2.Core.generate_proteins_from_transcripts` to construct protein sequences. The so generated protein sequences will contain the non-synonymous variants that effected its protein sequence.

In [4]:
proteins = generate_proteins_from_transcripts(trans)

By using `Fred2.Core.generate_peptides_from_proteins`, we can now generate polymorphic peptide objects from the previously generated polymorphic proteins. In addition to the proteins from which peptides are be generate, we have to specify a desired peptide length (e.g. 9-mers).

In [None]:
peptides1 = list(generate_peptides_from_proteins(proteins, 9))

If we are only interested in polymorphic peptides, or the high number of heterozygous variants prohibits the construction of all polymorphic transcripts/proteins, we can use `Fred2.Core.generate_peptides_from_variants`. This function restricts the combinatorial exploration to a specific window size, thereby reducing the number of possible combination to $2^m$ with $m << n$. The window size represents the length of the desired peptides (e.g. 9-mers).

In [None]:
peptides2 = generate_peptides_from_variants(vars, 9, mart, EIdentifierTypes.REFSEQ)

**Note**: All function starting with `generate` are true generator functions. That means, they stall the calculations until the actual objects are needed, but can only be traversed once.

<h2> Chapter 2: Epitope prediction </h2>
<br/>
Once we have generated the peptide objects, we can for example predict their binding affinity. For that, we first have to initialize HLA allele objects, and an epitope prediction method. For more information see the <a href="https://github.com/FRED-2/Fred2/blob/master/Fred2/tutorials/EpitopePrediction.ipynb">tutorial on epitope prediction</a>.


In [8]:
alleles = read_lines("./data/alleles.txt", in_type=Allele)
svmhc = EpitopePredictorFactory("svmhc")
pred_df = svmhc.predict(filter(lambda x: "*" not in str(x), peptides1), alleles=alleles)
pred_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,A*02:01,B*15:01
Seq,Method,Unnamed: 2_level_1,Unnamed: 3_level_1
"(A, A, A, Q, E, A, Q, A, D)",svmhc,-0.995096,-1.295513
"(A, A, D, F, P, R, W, K, R)",svmhc,-1.068312,-0.813017
"(A, A, F, L, A, G, L, L, S)",svmhc,-1.558208,-0.920914
"(A, A, F, L, L, Q, H, V, Q)",svmhc,-1.317206,-1.189752
"(A, A, F, M, Y, V, F, Y, V)",svmhc,0.607884,-0.709952


<h2> Chapter 3: Post-processing </h2>
<br/>
These polymorphic peptides have functionalities that allow the user to identify the variants that influenced the peptide sequences and locate their positions within the peptide sequence. With `Peptide.get_variants_by_protein()` we obtain a list of variants that influenced the peptide sequence originating from a specific protein. `Peptide.get_variants_by_protein_position()` returns a dictionary of where the key is the relative position of a variant within the peptide sequence that originated from a specific protein and protein position.

In [9]:
poly_peps = filter(lambda x: any(x.get_variants_by_protein(prot.transcript_id) for prot in x.get_all_proteins()) , peptides1)
c=0
for p in poly_peps:
    c+=1
    if c>=3:
        break
    for prot,poss in p.proteinPos.iteritems():
        print
        print prot, p.get_variants_by_protein(prot)
        for pos in poss:
            vars = p.get_variants_by_protein_position(prot, pos)
            if vars:
                print vars," Protein position: ",pos," Peptide: ",p


NM_144701:FRED2_1 [Variant(g.67705958G>A)]
{1: [Variant(g.67705958G>A)]}  Protein position:  379  Peptide:  FQTGIKRRI

NM_022162:FRED2_3 [Variant(g.50756540G>C)]
{8: [Variant(g.50756540G>C)]}  Protein position:  899  Peptide:  SLQFLGFWR

NM_022162:FRED2_1 [Variant(g.50756540G>C)]
{8: [Variant(g.50756540G>C)]}  Protein position:  899  Peptide:  SLQFLGFWR
