<h1> Using the Fred2 generators </h1>

This tutorial illustrates how to benefit from `Fred2` generators in your immunoinformatic scripts.

It is assumed that you know about basic python functionality and have made yourself familiar with `Fred2` through the other `Fred2` ipython notebooks.

Let's start by importing the neccessary modules:

In [1]:
from Fred2.Core import Variant, Transcript, Peptide, Protein
from Fred2.Core import generate_transcripts_from_variants
from Fred2.Core import generate_proteins_from_transcripts
from Fred2.Core import generate_peptides_from_proteins
from Fred2.Core import generate_peptides_from_variants
from Fred2.IO import read_annovar_exonic
from Fred2.IO.MartsAdapter import MartsAdapter
from Fred2.IO.ADBAdapter import EIdentifierTypes

<h2> Chapter 1: Generators in python </h2>
<br/>
We will first revisit the concept of generators in python before laying out the benefits of generators in immunoinformatic applications.
Python generator functions behave like iterators, i.e. they can be used in a for loop. But contrary to iterators, the elements to be iterated over do not have to reside in memory all at once.

As an exaggarated illustration, imagine a class of objects very rich in unique information, therefore taking up a really big chunk of space in memory, say 1Mb each. If you have a lot of them and want to access them in series, you can put them in a list and iterate over them. But putting them in a list, implies they have to exist all at once, making this list already taking up 1Gb of memory if you have only 1000 objects.

But if you only have a temporary interest in most of these objects *and* you can create them dynamically, this is an ideal case to save some memory with python generators. They will generate the objects on-the-go (and you can of course retain interesting ones).

What makes generators so powerful and flexible is, that the code for iterating a generator is the same as iterating a list.

In [2]:
def print_odd_numbers(numbers):
    for i in numbers:
        if i%2 != 0:
            print i
            
nums1 = [1,2,3,4,5,6,7,8,9,10] 
nums2 = xrange(11)
print "list:"
print_odd_numbers(nums1)
print "generator:"
print_odd_numbers(nums2)

list:
1
3
5
7
9
generator:
1
3
5
7
9


As you can see, the function **`print_odd_numbers`** works with both *lists* and *generators* (<a href="https://docs.python.org/2/library/functions.html#xrange">xrange</a> is a python generator that will produce all integer numbers in the given range). 

Note, that in generators, the current object is discarded after you iterate to the next. This way, you do not waste any memory. **But** this also means, you cannot 'reuse' **`nums2`** before you 'reinitialize' the generator. Also, you cannot random-access your objects (e.g. **`my_list[123]`**).

<h2> Chapter 2: Epitope prediction </h2>
<br/>
In `Fred2` we are dealing with a lot of sequences of different kinds. We take transcript sequences and integrate mutational variants and consider heterozygosities. We generate individualized protein sequences, from which we will slice peptides and calculate their immunological properties. And we are only interested in a specific set of them, say the predicted binders to a certain MHC molecule.
This is an ideal usecase for python generators, as we 
  * can generate our objects on-the-go,
  * are only interested in a few and 
  * have to deal with a whole lot of them and want to preserve memory.

So, for an concrete example, if we are only interested in polymorphic peptides, the high number of heterozygous variants prohibits the construction of all polymorphic transcripts at once. To worsen the situation we do not want to keep track of all resulting proteins but those producing immunological interesting peptides.
The combinatorical explosion prohibits such brute-force approaches even on medium sized sets of sequences.

In Fred2, we therefore have prepared a python generator solution for the `Fred2` objects of <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Variant">variants</a>, <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Transcript">transcripts</a>,<a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Protein">proteins</a> and <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Peptide">peptides</a> in the `Fred2.Core` module. The following chapter will introduce the usage of the `Fred2` basic generators.

We will start with the **`generate_transcripts_from_variants`** function. To have a small number of variants to show, how the generator works we will use an excerpt of an annovar output .

In [3]:
vars = read_annovar_exonic("data/annovar_excerpt.out")
print vars

[Variant(g.67705958G>A), Variant(g.234183368A>G), Variant(g.20763686G>-)]


The generator will take a list of variants, which must have some form of sequence identifier in their **`coding`** field, designating on which sequence they do denote a variation on. It also takes a <a href="http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.IO.DBAdapter">DBAdapter</a> to retrieve the sequences to the latter identifiers and an **`EIdentifierType`**, to indicate the type of identifier for the **`DBAdapter`**.

In [8]:
print vars[0].coding #show a variants coding
mart = MartsAdapter(biomart="http://www.ensembl.org")
trans = generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ)

{'NM_144701': <Fred2.Core.Variant.MutationSyntax instance at 0x7f42caebea70>}


Now we can simply iterate over our transcripts as they are created on-the-go.
       

In [9]:
for t in trans:
    print t.transcript_id, len(t)

NM_004004:FRED2_0 681
NM_004004:FRED2_1 680
NM_001190267:FRED2_0 1824
NM_001190267:FRED2_1 1824
NM_144701:FRED2_0 1890
NM_144701:FRED2_1 1890
NM_001190266:FRED2_0 1824
NM_001190266:FRED2_1 1824
NM_198890:FRED2_0 1335
NM_198890:FRED2_1 1335
NM_030803:FRED2_0 1824
NM_030803:FRED2_1 1824
NM_017974:FRED2_0 1767
NM_017974:FRED2_1 1767


We are iterating the generator, we print the generated transcripts id and the sequence length. As you can see, the generator creates heterzygous results. This is, because our variants are registered heterozygous!

In [10]:
vars[0].isHomozygous

False

We have already iterated over our transcript iterator, so if we want to use it again, we have to reinitialize it. Then, we can use it in combination with the next generator, the **`generate_proteins_from_transcripts`** generator. This one will need nothing more than a list of transcripts (or a generator thereof).

In [12]:
prots = generate_proteins_from_transcripts(generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ))
for p in prots:
    print p.transcript_id, len(p)

NM_004004:FRED2_0 226
NM_004004:FRED2_1 12
NM_001190267:FRED2_0 607
NM_001190267:FRED2_1 607
NM_144701:FRED2_0 629
NM_144701:FRED2_1 629
NM_001190266:FRED2_0 607
NM_001190266:FRED2_1 607
NM_198890:FRED2_0 444
NM_198890:FRED2_1 444
NM_030803:FRED2_0 607
NM_030803:FRED2_1 607
NM_017974:FRED2_0 588
NM_017974:FRED2_1 588


Once we have created our proteins, we can slice peptides with the **`generate_peptides_from_proteins`** generator. Therefore, we have to additionally specify the length of the peptides we want to generate. Here we choose 9mers.

In [13]:
 peps = generate_peptides_from_proteins(
          generate_proteins_from_transcripts(
            generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ))
        ,9)
    
ps  = set()
count = 0
for pp in peps:
    ps.add(pp)
    count += 1

print count,len(ps)

1491 1491


The generator will create a peptide only once, even if the peptides sequence should occur more than once. The originating proteins and within transcripts are kept track of for each occurrence.

In [14]:
list(ps)[-1]

PEPTIDE:
 QNILESHFN
in TRANSCRIPT: NM_144701:FRED2_0
	VARIANTS:
in TRANSCRIPT: NM_144701:FRED2_1
	VARIANTS:
in PROTEIN: NM_144701:FRED2_0
in PROTEIN: NM_144701:FRED2_1

Now we have prepeared an input suited for further analysis like <a href="EpitopePrediction.ipynb">EpitopePrediction</a>.