<h1> Using the Fred2 generators </h1>

This tutorial illustrates how to benefit from `Fred2` generators in your immunoinformatic scripts.

..scope..

It is assumed that you know about basic python functionality and have made yourself familiar with `Fred2` through the other `Fred2` ipython notebooks.

Let's start by importing the neccessary modules:

In [21]:
import sys
sys.path.extend(['/home/walzer/immuno-tools/Fred2'])

from Fred2.Core import Variant, Transcript, Peptide, Protein
from Fred2.Core import generate_transcripts_from_variants
from Fred2.Core import generate_proteins_from_transcripts
from Fred2.Core import generate_peptides_from_proteins
from Fred2.Core import generate_peptides_from_variants
from Fred2.IO import read_annovar_exonic
from Fred2.IO.MartsAdapter import MartsAdapter
from Fred2.IO.ADBAdapter import EIdentifierTypes

<h2> Chapter 1: Generators in python </h2>
<br/>
We will first revisit the concept of generators in python before laying out the benefits of generators in immunoinformatic applications.
Python generator functions behave like iterators, i.e. they can be used in a for loop. But contrary to iterators, the elements to be iterated over do not have to reside in memory all at once.

As an exaggarated illustration, imagine a class of objects very rich in unique information, therefore taking up a really big chunk of space in memory, say 1Mb each. If you have a lot of them and want to access them in series, you can put them in a list and iterate over them. But putting them in a list, implies they have to exist all at once, making this list already taking up 1Gb of memory if you have only 1000 objects.

But if you only have a temporary interest in most of these objects *and* you can create them dynamically, this is an ideal case to save some memory with python generators. They will generate the objects on-the-go (and you can of course retain interesting ones).

What makes generators so powerful and flexible is, that the code for iterating a generator is the same as iterating a list.

In [7]:
def print_odd_numbers(numbers):
    for i in numbers:
        if i%2 != 0:
            print i
            
nums1 = [1,2,3,4,5,6,7,8,9,10] 
nums2 = xrange(11)
print "list:"
print_odd_numbers(nums1)
print "generator:"
print_odd_numbers(nums2)

list:
1
3
5
7
9
generator:
1
3
5
7
9


As you can see, the function **`print_odd_numbers`** works with both *lists* and *generators* (<a href="https://docs.python.org/2/library/functions.html#xrange">xrange</a> is a python generator that will produce all integer numbers in the given range). 

Note, that in generators, the current object is discarded after you iterate to the next. This way, you do not waste any memory. **But** this also means, you cannot 'reuse' **`nums2`** before you 'reinitialize' the generator. Also, you cannot random-access your objects (e.g. **`my_list[123]`**).

<h2> Chapter 2: Epitope prediction </h2>
<br/>
In `Fred2` we are dealing with a lot of sequences of different kinds. We take transcript sequences and integrate mutational variants and consider heterozygosities. We generate individualized protein sequences, from which we will slice peptides and calculate their immunological properties. And we are only interested in a specific set of them, say the predicted binders to a certain MHC molecule.
This is an ideal usecase for python generators, as we 
  * can generate our objects on-the-go,
  * are only interested in a few and 
  * have to deal with a whole lot of them and want to preserve memory.

So, for an concrete example, if we are only interested in polymorphic peptides, the high number of heterozygous variants prohibits the construction of all polymorphic transcripts at once. To worsen the situation we do not want to keep track of all resulting proteins but those producing immunological interesting peptides.
The combinatorical explosion prohibits such brute-force approaches even on medium sized sets of sequences.

In Fred2, we therefore have prepared a python generator solution for the `Fred2` objects of <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Variant">variants</a>, <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Transcript">transcripts</a>,<a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Protein">proteins</a> and <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Peptide">peptides</a> in the `Fred2.Core` module. The following chapter will introduce the usage of the `Fred2` basic generators.

We will start with the **`generate_transcripts_from_variants`** function. To have a small number of variants to show, how the generator works we will use an excerpt of an annovar output .

In [34]:
vars = read_annovar_exonic("data/annovar_excerpt.out")
print vars

[Variant(g.67705958G>A), Variant(g.234183368A>G), Variant(g.20763686G>-)]


The generator will take a list of variants, which must have some form of sequence identifier in their **`coding`** field, designating on which sequence they do denote a variation on. It also takes a <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.IO.DBAdapter">DBAdapter</a> to retrieve the sequences to the latter identifiers and an **`EIdentifierType`**, to indicate the type of identifier for the **`DBAdapter`**.

In [38]:
print vars[0].coding #show a variants coding
mart = MartsAdapter(biomart="http://www.ensembl.org")
trans = generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ)

{'NM_144701': <Fred2.Core.Variant.MutationSyntax instance at 0x7f15c93b0200>}


Now we can simply iterate over our transcripts as they are created on-the-go.
       

In [39]:
for t in trans:
    print len(t)

681
680
1824
1824
1890
1890
1824
1824
1335
1335
1824
1824
1767
1767


In [None]:
prot = set(map(lambda x: str(x),
                       Generator.generate_proteins_from_transcripts(
                           Generator.generate_transcripts_from_variants(dummy_vars, dummy_db, EIdentifierTypes.REFSEQ)))
                   )


In [None]:
 peps_from_prot = set(map(str, Generator.generate_peptides_from_proteins(Generator.generate_proteins_from_transcripts(
                           Generator.generate_transcripts_from_variants(dummy_vars, dummy_db, EIdentifierTypes.REFSEQ)),
            3)))

<h2> Chapter 3: Post-processing </h2>
<br/>
These polymorphic peptides have functionalities that allow the user to identify the variants that influenced the peptide sequences and locate their positions within the peptide sequence. With `Peptide.get_variants_by_protein()` we obtain a list of variants that influenced the peptide sequence originating from a specific protein. `Peptide.get_variants_by_protein_position()` returns a dictionary of where the key is the relative position of a variant within the peptide sequence that originated from a specific protein and protein position.

In [9]:
poly_peps = filter(lambda x: any(x.get_variants_by_protein(prot.transcript_id) for prot in x.get_all_proteins()) , peptides1)
c=0
for p in poly_peps:
    c+=1
    if c>=3:
        break
    for prot,poss in p.proteinPos.iteritems():
        print
        print prot, p.get_variants_by_protein(prot)
        for pos in poss:
            vars = p.get_variants_by_protein_position(prot, pos)
            if vars:
                print vars," Protein position: ",pos," Peptide: ",p


NM_144701:FRED2_1 [Variant(g.67705958G>A)]
{1: [Variant(g.67705958G>A)]}  Protein position:  379  Peptide:  FQTGIKRRI

NM_022162:FRED2_3 [Variant(g.50756540G>C)]
{8: [Variant(g.50756540G>C)]}  Protein position:  899  Peptide:  SLQFLGFWR

NM_022162:FRED2_1 [Variant(g.50756540G>C)]
{8: [Variant(g.50756540G>C)]}  Protein position:  899  Peptide:  SLQFLGFWR
