# An overview on Gene Ontology (? Team)

#### CAFA 5 Protein Function Prediction Competition
https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/overview

Dataset: 
https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/data

This notebook uses code provided by: 

The Erdös Institute May 2023 Bootcamp
https://github.com/TheErdosInstitute/code-2023


## An overview of Gene Ontology. 
https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/data

### Gene Ontology consists of three subontologies: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC)

These can be described by what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC)

This dataset uses experimentally determined protein assignments.

#### Training Set
For the training set, we include all proteins with annotated terms that have been validated by experimental or high-throughput evidence, traceable author statement (evidence code TAS), or inferred by curator (IC). More information about evidence codes can be found here. We use annotations from the UniProtKB release of 2022-11-17. The participants are not required to use these data and are also welcome to use any other data available to them.

#### Test Superset
The test superset is a set of protein sequences on which the participants are asked to predict GO terms.

#### Test Set
The test set is unknown at the beginning of the competition. It will contain protein sequences (and their functions) from the test superset that gained experimental annotations between the submission deadline and the time of evaluation.

# File Descriptions

### Gene Ontology: 

The ontology data is in the file go-basic.obo. This structure is the 2023-01-01 release of the GO graph. This file is in OBO format, for which there exist many parsing libraries. For example, the obonet package is available for Python. The nodes in this graph are indexed by the term name, for example the roots of the three onotlogies are:

subontology_roots = {'BPO':'GO:0008150',
                     'CCO':'GO:0005575',
                     'MFO':'GO:0003674'}
                     

In [2]:
# This code uses code exerpts from the following obonet tutorial 
# https://github.com/dhimmel/obonet/blob/main/examples/go-obonet.ipynb

# !pip install obonet

import obonet #https://pypi.org/project/obonet/
import networkx

ontologyDataFile = "../Data/Train/go-basic.obo" 
obonet.read_obo(ontologyDataFile)

# Read the taxrank ontology
graph = obonet.read_obo(ontologyDataFile)

# Number of nodes
print(len(graph))

# Number of edges
print(graph.number_of_edges())

# Check if the ontology is a DAG
print(networkx.is_directed_acyclic_graph(graph))

# Mapping from term ID to name
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}
print(id_to_name['GO:0008150'])  # 'GO:0008150' is Biological Process
print(id_to_name['GO:0005575'])  # 'GO:0005575' is Cellular Component
print(id_to_name['GO:0003674'])  # 'GO:0003674' is Molecular Function

# Find all superterms of species. Note that networkx.descendants gets
# superterms, while networkx.ancestors returns subterms.
print(networkx.descendants(graph, 'GO:0008150'))

FileNotFoundError: [Errno 2] No such file or directory: '../Data/Train/go-basic.obo'

## Lookup node properties