# Lecture 7 - Functional gene annotation

In this lecture you learned about annotating gene function using controlled vocabularies like **EC numbers** and **GO terms**.

### Learning objectives:

- Practice basic programming skills (parsing, filtering, ...)
- Using recursive functions


## Exercise 1:

In lecture 3 (exercise 1, option 2), we used [**Prokka**](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517) to identify all the genes (open reading frames) present in an assembled genome. 

In addition to identifying gene sequences, Prokka also annotates those sequences (using homology-based annotation transfer) by BLASTing them against reference databases (UniProt, RefSeq, Pfam). You might have noticed that, along the FASTA file with the detected ORFs, it also generated an annotation file in [**GenBank**](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) format. 
This file format is more detailed than a simple FASTA file, and contains several annotated **features** (genes and respective functions) for each contig in the original fasta file.

Let's start by loading that file: 

In [None]:
from Bio import SeqIO

contigs = list(SeqIO.parse('files/annotated.gbk', 'genbank'))

We can print the first annotated contig for an overview of the number of annotated features:

In [None]:
print(contigs[0])

> Remember: the genome we annotated in Lecture 3 was not a fully assembled genome, it was a FASTA file with (the best possible) assembly of the raw sequencing data we assembled in Lecture 2.
> The result was a FASTA file with multiple contigs (by decreasing order of size) and, therefore, the annotated features in this GenBank file are grouped by contig.

Let's check how many features are present in each contig:

In [None]:
feats_per_contig = [len(contig.features) for contig in contigs]
print(feats_per_contig)

**Fundamental law** of the universe: It's always best to plot your data!

In [None]:
import matplotlib.pyplot as plt

plt.hist(feats_per_contig)
plt.xlabel('number of features')
plt.ylabel('number of contigs')

Here is how we could look at the first 5 features of the 3rd contig: 

> Note: the first feature (*source*) is actually just the original contig.

In [None]:
for feature in contigs[2].features[:5]:
    print(feature)

### 1.1

Each feature is represented as a [**SeqFeature**](https://biopython.org/docs/1.75/api/Bio.SeqFeature.html) object.

- Take a look at the documentation of [**SeqFeature**](https://biopython.org/docs/1.75/api/Bio.SeqFeature.html)
- Make a loop that iterates over the features of the first contig
- Print the first **EC number** you find (and stop). 

In [None]:
# type your code here

Click below to see a solution:

In [None]:

for feature in contigs[0].features:
    if 'EC_number' in feature.qualifiers:
        print(feature.qualifiers['EC_number'])
        break


### 1.2

Let's analyse the functional potential of the organism (in terms of metabolic function diversity):

- Create a list (or set) of unique (i.e. not repeated) EC numbers across all contigs
- Print the total number of EC numbers in each class

Reminder, these are the top classes:

- EC 1 - Oxidoreductases
- EC 2 - Transferases
- EC 3 - Hydrolases
- EC 4 - Lyases
- EC 5 - Isomerases
- EC 6 - Ligases
- EC 7 - Translocases

In [None]:
# type your code here...

Click the cell below to see the solution:

In [None]:

ec_numbers = set()

for contig in contigs:
    for feature in contig.features:
        if 'EC_number' in feature.qualifiers:
            ec_numbers.update(feature.qualifiers['EC_number'])
            
for i in range(1,8):
    ecs_i = [x for x in ec_numbers if x.split('.')[0] == str(i)]
    print(f'EC {i}: {len(ecs_i)}')

---------

## Exercise 2 - Gene Ontology

[**GOATOOLS**](https://www.nature.com/articles/s41598-018-28948-z) is a useful Python library to work with the [Gene Ontology](https://geneontology.org/).

Let's start by loading the latest version (well... at least it was when I wrote this) of the complete ontology from a local file (downloaded from [here](https://geneontology.org/docs/download-ontology/)):

In [None]:
from goatools import obo_parser
go_terms = obo_parser.GODag('files/go-basic.obo', optional_attrs='xref')

Let's inspect the first term...

In [None]:
go_terms['GO:0000001']

### 2.1

In the genome annotations we obtained EC numbers but not GO terms (unfortuntely Prokka does not annotate with GO terms), but we can use the GO library to find which GO terms are associated with those EC numbers.

Let's search for our good old friend **2.7.1.11** (*PFK1*, obviously). Can you find the GO term that corresponds to this enzyme?

> **Tips**: `go_terms` is a dictionary from GO term ids to GO term objects, you can find the respective EC number (if it exists) in the `xref` attribute and it starts with `EC:`.

In [None]:
# type your code here...

Click below to see the solution:

In [None]:

for go_id, go_term in go_terms.items():
    if 'EC:2.7.1.11' in go_term.xref:
        print(go_term)
        break

### 2.2

Remember that GO terms are organized in a hierarchy. Each GO term contains the attributes `.parents` and `.children` with a *set* of GO terms that are (immediately) above or below. 

Create a loop that prints a list (including id and name) of all the ancestors (parents, grandparents, ...) of the GO term you just found.

> Tip: this looks like a job for a recursive function (*i.e.* a function that calls itself).

In [None]:
# type your code here...

Click below to see the solution:

In [None]:

def get_parents(go_term, indent=0):
    # indent is just a visual aid, you can ignore that
    spacing = ' '*14*indent
    
    for parent in go_term.parents:
        print(f"{spacing}{go_term.id} -> {parent.id}: {parent.name}")
        get_parents(parent, indent+1)

go = go_terms['GO:0003872']
get_parents(go)

### 2.3

Inspect the tree of your term in [QuickGO](https://www.ebi.ac.uk/QuickGO/) to confirm that you got all the correct terms. 

- 🤔 Did you get a similar tree ?  
- 🧠 What is the main difference ?

## Wrap-up

This session was mostly to practice some Python skills. If you got stuck in an exercise, just ask for help. Or, if you found the exercises too simple, maybe *you* can help someone. 😉