# An overview on Gene Ontology (? Team)

#### CAFA 5 Protein Function Prediction Competition
https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/overview

Dataset: 
https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/data

This notebook uses code provided by: 

The Erdös Institute May 2023 Bootcamp
https://github.com/TheErdosInstitute/code-2023


## An overview of Gene Ontology. 
https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/data

### Gene Ontology consists of three subontologies: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC)

These can be described by what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC)

This dataset uses experimentally determined protein assignments.

#### Training Set
For the training set, we include all proteins with annotated terms that have been validated by experimental or high-throughput evidence, traceable author statement (evidence code TAS), or inferred by curator (IC). More information about evidence codes can be found here. We use annotations from the UniProtKB release of 2022-11-17. The participants are not required to use these data and are also welcome to use any other data available to them.

#### Test Superset
The test superset is a set of protein sequences on which the participants are asked to predict GO terms.

#### Test Set
The test set is unknown at the beginning of the competition. It will contain protein sequences (and their functions) from the test superset that gained experimental annotations between the submission deadline and the time of evaluation.

# File Descriptions

### Gene Ontology: 

The ontology data is in the file go-basic.obo. This structure is the 2023-01-01 release of the GO graph. This file is in OBO format, for which there exist many parsing libraries. For example, the obonet package is available for Python. The nodes in this graph are indexed by the term name, for example the roots of the three onotlogies are:

subontology_roots = {'BPO':'GO:0008150',
                     'CCO':'GO:0005575',
                     'MFO':'GO:0003674'}
                     

In [1]:
# This code uses code exerpts from the following obonet tutorial 
# https://github.com/dhimmel/obonet/blob/main/examples/go-obonet.ipynb

# !pip install obonet

import obonet #https://pypi.org/project/obonet/
import networkx

ontologyDataFile = "../Data/Train/go-basic.obo" 
obonet.read_obo(ontologyDataFile)

# Read the taxrank ontology
graph = obonet.read_obo(ontologyDataFile)

# Number of nodes
print(len(graph))

# Number of edges
print(graph.number_of_edges())

# Check if the ontology is a DAG
print(networkx.is_directed_acyclic_graph(graph))

# Mapping from term ID to name
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}
print(id_to_name['GO:0008150'])  # 'GO:0008150' is Biological Process
print(id_to_name['GO:0005575'])  # 'GO:0005575' is Cellular Component
print(id_to_name['GO:0003674'])  # 'GO:0003674' is Molecular Function

# Find all superterms of species. Note that networkx.descendants gets
# superterms, while networkx.ancestors returns subterms.
print(networkx.descendants(graph, 'GO:0008150'))

43248
84805
True
biological_process
cellular_component
molecular_function
set()


## Lookup node properties

In [2]:
# Retreive properties of phagocytosis
graph.nodes['GO:0006909']

{'name': 'phagocytosis',
 'namespace': 'biological_process',
 'def': '"A vesicle-mediated transport process that results in the engulfment of external particulate material by phagocytes and their delivery to the lysosome. The particles are initially contained within phagocytic vacuoles (phagosomes), which then fuse with primary lysosomes to effect digestion of the particles." [ISBN:0198506732]',
 'xref': ['Wikipedia:Phagocytosis'],
 'is_a': ['GO:0016192']}

## Create name mappings


In [3]:
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}
name_to_id = {data['name']: id_ for id_, data in graph.nodes(data=True) if 'name' in data}

# Get the name for GO:0042552
print(id_to_name['GO:0042552'])

# Get the id for myelination
print(name_to_id['myelination'])

myelination
GO:0042552


## Find parent or child relationships

In [4]:
# Find edges to parent terms
node = name_to_id['pilus']
for child, parent, key in graph.out_edges(node, keys=True):
    print(f'• {id_to_name[child]} ⟶ {key} ⟶ {id_to_name[parent]}')

• pilus ⟶ is_a ⟶ cell projection


In [5]:
# Find edges to children terms
node = name_to_id['pilus']
for parent, child, key in graph.in_edges(node, keys=True):
    print(f'• {id_to_name[child]} ⟵ {key} ⟵ {id_to_name[parent]}')

• pilus ⟵ part_of ⟵ pilus shaft
• pilus ⟵ part_of ⟵ pilus tip
• pilus ⟵ is_a ⟵ type IV pilus
• pilus ⟵ is_a ⟵ curli
• pilus ⟵ is_a ⟵ type I pilus


## Find all superterms to myelination

In [6]:
sorted(id_to_name[superterm] for superterm in networkx.descendants(graph, 'GO:0042552'))


['anatomical structure development',
 'axon ensheathment',
 'biological_process',
 'cellular process',
 'developmental process',
 'ensheathment of neurons',
 'multicellular organism development',
 'multicellular organismal process',
 'nervous system development',
 'system development']

## Find all subterms to myelination

In [7]:
sorted(id_to_name[subterm] for subterm in networkx.ancestors(graph, 'GO:0042552'))


['central nervous system myelin formation',
 'central nervous system myelin maintenance',
 'central nervous system myelination',
 'myelin assembly',
 'myelin maintenance',
 'myelination in peripheral nervous system',
 'myelination of anterior lateral line nerve axons',
 'myelination of lateral line nerve axons',
 'myelination of posterior lateral line nerve axons',
 'negative regulation of myelination',
 'paranodal junction assembly',
 'peripheral nervous system myelin formation',
 'peripheral nervous system myelin maintenance',
 'positive regulation of myelination',
 'regulation of myelination']

## Find all paths to the root

In [8]:
paths = networkx.all_simple_paths(
    graph,
    source=name_to_id['starch binding'],
    target=name_to_id['molecular_function']
)
for path in paths:
    print('•', ' ⟶ '.join(id_to_name[node] for node in path))

• starch binding ⟶ polysaccharide binding ⟶ carbohydrate binding ⟶ binding ⟶ molecular_function


## See the ontology metadata

In [10]:
graph.graph

{'typedefs': [{'id': 'negatively_regulates',
   'name': 'negatively regulates',
   'namespace': 'external',
   'xref': ['RO:0002212'],
   'is_a': ['regulates']},
  {'id': 'part_of',
   'name': 'part of',
   'namespace': 'external',
   'xref': ['BFO:0000050'],
   'is_transitive': 'true'},
  {'id': 'positively_regulates',
   'name': 'positively regulates',
   'namespace': 'external',
   'xref': ['RO:0002213'],
   'holds_over_chain': ['negatively_regulates negatively_regulates'],
   'is_a': ['regulates']},
  {'id': 'regulates',
   'name': 'regulates',
   'namespace': 'external',
   'xref': ['RO:0002211'],
   'is_transitive': 'true'},
  {'id': 'term_tracker_item',
   'name': 'term tracker item',
   'namespace': 'external',
   'xref': ['IAO:0000233'],
   'is_metadata_tag': 'true',
   'is_class_level': 'true'}],
 'instances': [],
 'format-version': '1.2',
 'data-version': 'releases/2023-01-01',
 'subsetdef': ['chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."',

## Create a dictionary of obsolete terms to their replacements


In [11]:
graph_with_obs = obonet.read_obo(ontologyDataFile, ignore_obsolete=False)
len(graph_with_obs)


47417

In [12]:
old_to_new = dict()
for node, data in graph_with_obs.nodes(data=True):
    for replaced_by in data.get("replaced_by", []):
        old_to_new[node] = replaced_by
list(old_to_new.items())[:5]

[('GO:0000108', 'GO:0000109'),
 ('GO:0000174', 'GO:0000750'),
 ('GO:0000229', 'GO:0005694'),
 ('GO:0000260', 'GO:0046961'),
 ('GO:0000261', 'GO:0046962')]