# Lecture 7 - Functional gene annotation

In this lecture you have learned about finding and describing the function of genes and proteins.

In lecture 3, we used [Prokka](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517) to identify all the genes (open reading frames) present in an assembled genome. In addition to identifying gene sequences, Prokka also annotates those sequences, using homology-based annotation transfer, by BLASTing them against reference databases (UniProt, RefSeq, Pfam).

Let's load the annotations file (in *Genbank* format) generated by Prokka. This file format is more detailed than a simple FASTA file, and contains several annotated **features** (genes and respective functions) for each contig in the original fasta file:

In [None]:
from Bio import SeqIO

contigs = list(SeqIO.parse('files/annotated.gbk', 'genbank'))

We can print the first annotated contig for an overview of the number of annotated genes (features):

In [None]:
print(contigs[0])

And these are the first features of the first contig (the first feature is actually just the original sequence):

In [None]:
for feature in contigs[0].features[:3]:
    print(feature)

Let's find a gene that has been annotated with an EC number:

In [None]:
for feature in contigs[0].features:
    if 'EC_number' in feature.qualifiers:
        print(feature)
        break


-------

## Exercise 1:

Among the features that Prokka can annotate are EC numbers (unfortunately it does not annotate GO terms). 

Create a list (or a set, if you want to avoid repetitions) of all EC numbers annotated in this genome assembly.

> How many (unique) EC numbers did you find ?

In [None]:
# type your code here...

Click the cell below to see the solution:

In [None]:
ec_numbers = set()

for contig in contigs:
    for feature in contig.features:
        if 'EC_number' in feature.qualifiers:
            ec_numbers.update(feature.qualifiers['EC_number'])
            
print('number of unique EC numbers:', len(ec_numbers))

---------

## Gene Ontology

[GOATOOLS](https://www.nature.com/articles/s41598-018-28948-z) is a useful Python library to work with the Gene Ontology.

Let's begin by installing it...

Now let's load the latest (well, it was when I wrote this) version of the complete ontology from a local file:

In [1]:
from goatools import obo_parser
go_terms = obo_parser.GODag('files/go-basic.obo', optional_attrs='xref')

files/go-basic.obo: fmt(1.2) rel(2021-10-26) 47,197 Terms; optional_attrs(xref)


> **Note**: if the code above fails, try restarting the kernel.

Let's inspect the first term...

In [2]:
go_terms['GO:0000001']

GOTerm('GO:0000001'):
  id:GO:0000001
  item_id:GO:0000001
  name:mitochondrion inheritance
  namespace:biological_process
  _parents: 2 items
    GO:0048311
    GO:0048308
  parents: 2 items
    GO:0048308	level-05	depth-05	organelle inheritance [biological_process]
    GO:0048311	level-05	depth-06	mitochondrion distribution [biological_process]
  children: 0 items
  level:6
  depth:7
  is_obsolete:False
  alt_ids: 0 items
  xref: 0 items

## Exercise 2.1:

In the genome annotations we obtained EC numbers but no GO terms. We can use the GO library to find which GO terms are associated with those EC numbers.

Let's search for our good old friend **2.7.1.11** (*PFK1*, obviously). Can you find the respective GO term?

> Tips: `go_terms` is a dictionary from GO term ids to GO term objects, you can find the respective EC number (if it exists) in the `xref` attribute and it starts with `EC:`.

In [None]:
# type your code here...

Click below to see the solution:

In [None]:
for go_id, go_term in go_terms.items():
    if 'EC:2.7.1.11' in go_term.xref:
        print(go_term)
        break

## Exercise 2.2:

Each GO term contains an attribute `.parents` with a list of its (immediate) parents. Can you create a loop that prints a list (including id and name) of all the ancestors of the GO terms you just found? 

> Tip: this looks like a job for a recursive function (*i.e.* a function that calls itself).

In [None]:
# type your code here...

Click below to see the solution:

In [None]:
def get_parents(go_term, indent=0):
        
    for parent in go_term.parents:
        print("\t"*indent, f"{go_term.id} -> {parent.id}: {parent.name}")
        get_parents(parent, indent+1)

go = go_terms['GO:0003872']
get_parents(go)

Inspect the tree of your term in [QuickGO](https://www.ebi.ac.uk/QuickGO/) to confirm that you got all the correct terms. 
- Did you get a similar tree? 
- What are the differences ?