# Lecture 7 - Functional gene annotation

In this lecture you have learned about finding and describing the function of genes and proteins.

In lecture 3, we used [Prokka](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517) to identify all the genes (open reading frames) present in an assembled genome. In addition to identifying gene sequences, Prokka also annotates those sequences, using homology-based annotation transfer, by BLASTing them against reference databases (UniProt, RefSeq, Pfam).

Let's load the annotations file (in *Genbank* format) generated by Prokka. This file format is more detailed than a simple FASTA file, and contains several annotated **features** for each gene sequence:

In [None]:
from Bio import SeqIO

sequences = list(SeqIO.parse('../lecture_03/files/output/annotated.gbk', 'genbank'))

print(sequences[0])

-------

## Exercise 1:

Among the features that Prokka can annotate are EC numbers (unfortunately it does not annotate GO terms). 

Create a list (or a set, if you want to avoid repetitions) of all EC numbers annotated in this genome assembly.

> How many (unique) EC numbers did you find ?

In [None]:
# type your code here...

Click the cell below to see the solution:

In [None]:
ec_numbers = set()

for seq in sequences:
    for feature in seq.features:
        if 'EC_number' in feature.qualifiers:
            ec_numbers.update(feature.qualifiers['EC_number'])
            
print(len(ec_numbers))

---------

## Gene Ontology

[GOATOLS](https://www.nature.com/articles/s41598-018-28948-z) is a useful Python library to work with the Gene Ontology.

Let's begin by installing it...

In [None]:
!pip install goatools

Now let's load the latest (well, it was when I wrote this) version of the complete ontology from a local file:

In [None]:
from goatools import obo_parser
go_terms = obo_parser.GODag('files/go-basic.obo', optional_attrs='xref')

In [None]:
go_terms['GO:0000001']

## Exercise 2.1:

You might notice that among the list of EC numbers detected in the genome, was our good old friend **2.7.1.11** (*PFK1*, obviously). Can you find the respective GO term?

> Tips: `go_terms` is a dictionary from GO term ids to GO term objects, you can find the respective EC number (if it exists) in the `xref` attribute and it starts with `EC:`.

In [None]:
# type your code here...

Click below to see the solution:

In [None]:
for go_id, go_term in go_terms.items():
    if 'EC:2.7.1.11' in go_term.xref:
        print(go_term)
        break

## Exercise 2.2:

Each GO term contains an attribute `.parents` with a list of its (immediate) parents. Can you create a loop that prints a list (including id and name) of all the ancestors? 

> Tip: this looks like a job for a recursive function (*i.e.* a function that calls itself).

In [None]:
# type your code here...

Click below to see the solution:

In [None]:
def get_parents(go_term):
        
    for parent in go_term.parents:
        print(f"{go_term.id} -> {parent.id}: {parent.name}")
        get_parents(parent)

go = go_terms['GO:0003872']
get_parents(go)

Inspect the tree of your term in [QuickGO](https://www.ebi.ac.uk/QuickGO/) to confirm that you got all the correct terms. 
- Did you get a similar tree? 
- What are the differences ?