# Lecture 7 - Functional gene annotation

In this lecture you learned about annotating gene function using controlled vocabularies like **EC numbers** and **GO terms**.

### Learning objectives:

- Learn to functionally annotate genes with InterProScan
- Become familiar with the Pandas library
- Navigating ontologies with recursive functions

## Exercise 1

In lecture 2 we assembled the genome of [*Mycoplasma pneumoniae*](https://en.wikipedia.org/wiki/Mycoplasma_pneumoniae) from sequencing data downloaded from the European Nucleotide Archive (accession [DRR040043](https://www.ebi.ac.uk/ena/browser/view/DRR040043?show=reads)).

In lecture 3 we used **Prodigal** to find all open reading frames in the assembled genome and translate them to amino acid sequences.

In this exercise, we will functionally annotate those amino acid sequences using [InterProScan](https://www.ebi.ac.uk/interpro/about/interproscan/). This tool scans a query protein sequence against all protein families in InterPro (using HMM profiles as discussed in lecture 5).

The search engine for **InterProScan** works similarly to **BLAST**, using the [web portal](https://www.ebi.ac.uk/interpro/search/sequence/) you can only submit up to 100 sequences at once. Fortunately, there is an instance of **InterProScan** available in **Galaxy** that allows to annotate all the sequences in a FASTA file:

![galaxy](files/interproscan.png)

* Upload the [proteins.faa](files/proteins.faa) file (you must first download it from *lecture_07/files*)
* Search for *InterProScan* and select the file you just uploaded
* Make sure all databases are selected
* Select **Run Tool**

ðŸ‘‰ This will take about 30 minutes to run. In the next exercise we will use a pre-computed result. 

## Exercise 2

The annotation result is a TSV (tab-separated values) file that you can very easily load with **Pandas**.

ðŸ‘‰ **Pandas** is an extremely powerful library for working with tabular data (including Excel files). I strongly encourage you to spend some minutes looking at the [documentation](https://pandas.pydata.org/docs/index.html).

Let's start by loading the results:

In [None]:
import pandas as pd

df = pd.read_csv('files/results.tsv', sep='\t', header=None)
df

As you can see there are several columns. According to the [InterProScan documentation](https://interproscan-docs.readthedocs.io/en/v5/OutputFormats.html#tab-separated-values-format-tsv) the output format is as follows:

1. Protein accession 
2. Sequence MD5 digest 
3. Sequence length 
4. Analysis (e.g. Pfam / PRINTS / Gene3D)
5. Signature accession 
6. Signature description 
7. Start location
8. Stop location
9. Score - is the e-value (or score) 
10. Status - is the status of the match 
11. Date - is the date of the run
12. InterPro annotations - accession
13. InterPro annotations - description
14. GO annotations with their source(s)
15. Pathways annotations 

ðŸ‘‰ Note that the columns in Pandas are numbered 0 to 14.

Let's only keep the query protein, target database, and annotated GO terms:

In [None]:
df = pd.read_csv('files/results.tsv', sep='\t', header=None, usecols=[0, 3, 13], na_values='-', names=['query', 'database', 'GO']).dropna()
df.sample(10)

Let's also *"unpack"* the GO terms, so that we only have one term per row (this will replicate the values in the other columns):

In [None]:
df['GO'] = df['GO'].str.split('|')
df = df.explode('GO').drop_duplicates()
df.sample(10)

We can use `.value_counts()` to find the most frequent values in each column:

In [None]:
df['database'].value_counts()

### 2.1

What is the average (median) number of GO terms annotated for each protein?

> ðŸ’¡ Tip: you can combine the results of multiple operations over pandas dataframes.

In [None]:
# type your code here

Click below to see a solution...

In [None]:

x = df['query'].value_counts().median()
print(f'The answer is: {x:n}') # the :n here is just formatting the float as an integer 'n'umber

### 2.2

Can you print the top 10 most frequent GO terms?

> ðŸ’¡ Tip: the `.head()` method can be useful.

In [None]:
# type your code here...

Click below to see a solution...

In [None]:

df['GO'].value_counts().head(10)

---------

## Exercise 3 - Gene Ontology

[**GOATOOLS**](https://www.nature.com/articles/s41598-018-28948-z) is a useful Python library to work with the [Gene Ontology](https://geneontology.org/).

Let's start by loading the latest version (well... at least it was when I wrote this) of the complete ontology from a local file (downloaded from [here](https://geneontology.org/docs/download-ontology/)):

In [None]:
from goatools import obo_parser
go_terms = obo_parser.GODag('files/go-basic.obo', optional_attrs='xref')

We can use this library to get more information about the GO terms annotated in our genome. 

`go_terms` is a dictionary from GO term *"ids"* to GO term *"objects"*. Let's inspect the first term...

In [None]:
go_terms['GO:0000001']

### 3.1

InterProScan annotates with GO terms but not with EC numbers.

Create a function called `go2ec()` that receives a GO term identifier and returns the respective EC number (only if that GO term represents an enzymatic reaction).

> ðŸ’¡ **Tip**: You can find the respective EC number (if it exists) in the `xref` attribute and it starts with `EC:`.

ðŸ‘‰ Note that you might not be able to find some GO terms in the dictionary because they are obsolete. Example: [GO:0045261](https://www.ebi.ac.uk/QuickGO/term/GO:0045261) (just ignore those).

In [None]:
# type your code here...

Click below to see the solution:

In [None]:

def go2ec(go_id):
    if go_id in go_terms:
        go_term = go_terms[go_id]
        for data in go_term.xref:
            if data.startswith('EC:'):
                return data

Let's test your solution:

In [None]:
df['EC'] = df['GO'].apply(go2ec)
df.dropna().sample(10)

### 3.2

Remember that GO terms are organized in a hierarchy. Each GO term contains the attributes `.parents` and `.children` with a *set* of GO terms that are (immediately) above or below. 

Create a loop that prints a list (including id and name) of all the ancestors (parents, grandparents, ...) of the most frequent GO term that you found in exercise 2.2.

> ðŸ’¡ **Tip**: this looks like a job for a recursive function (*i.e.* a function that invokes itself).

In [None]:
# type your code here...

Click below to see the solution:

In [None]:

def get_parents(go_term, indent=0):
    # indent is just a visual aid, you can ignore that
    spacing = ' '*14*indent
    
    for parent in go_term.parents:
        print(f"{spacing}{go_term.id} -> {parent.id}: {parent.name}")
        get_parents(parent, indent+1)

go = go_terms['GO:0005524']
get_parents(go)

### 3.3

Inspect the tree of your term in [QuickGO](https://www.ebi.ac.uk/QuickGO/) to confirm that you got all the correct terms. 

- ðŸ¤” Did you get a similar tree ?  

## Wrap-up

This session had a lot of new things going on. If you got stuck in an exercise, just ask for help or, if you found the exercises too simple, maybe *you* can help someone. ðŸ˜‰