# GO-BP ID Ancestor Functions

This program removes GO ID ancestors using the gene ontology file 'go.obo' downloaded from [Geneontology.org](http://geneontology.org/).
Consider the terms: 

    GO:0006694, GO:0006702, GO:0006704, GO:0007548, GO:0008202, GO:0042446, GO:0042448, GO:0055114

In this case 3 out of the 8 GO terms are redundant. The GO terms without ancestors are: 

    GO:0006702, GO:0042448, GO:0006704, GO:0007548, GO:0055114


### Create an acyclic-directed graph using the GO.obo file

In [1]:
# Use goatools to access Gene Ontology .obo files.
from goatools import obo_parser

# Use os module to access files in other directories.
from os import path

# Create directed acyclilc-graph of the 'go.obo' file.
gene_ontology = path.abspath(
    '../Gene-Disease Associations/go.obo')

ontology = obo_parser.GODag(gene_ontology)

D:\Documents\Research\Paper\Camera Ready\Programs\Gene-Disease Associations\go.obo: fmt(1.2) rel(2020-06-01) 47,233 GO Terms


### Create a list of ancestors to increase access speed

In [2]:
# Define a dictionary to store the ancestors of each GO-BP term.
# This is done to increase access speed.
ancestors = {}

# Iterate thru every term in the gene ontology.
for term in ontology:
    
    # Store ancestors of each GO-BP term.
    ancestors[term] = ontology[term].get_all_parents()

### Define a function that takes a list of GO IDs and removes the GO IDs that are ancestors

In [3]:
def remove_ancestors(bp_set):
    '''Take a set of BP IDs and remove IDs that are
    redundant because a more specific child is already 
    found in the set. 
    
    Parameters:
    bp_set (set): Set of BP ID strings.
    '''
    # Create an empty set to store the parents of each BP term.
    parents = set()
    
    # Iterate thru every shared BP term.
    for bp in bp_set:
        
        # Add the ancestors of each BP term to the set of parents.
        parents = parents.union(ancestors[bp])
    
    # Remove BP ancestors (parents).
    return bp_set.difference(parents)

### Define a function that finds the shared BP terms found in two sets and removes redundant BP ancestors

In [4]:
def get_shared_bps_no_ancestors(set1, set2):
    '''Return set of shared BP terms that two diseases share
    after removing redundant BP term ancestors.
    
    Parameters:
    set1 (set): The set of BP terms of disease 1.
    set2 (set): The set of BP terms of disease 2.
    '''
    # Create a set of shared BP terms.
    shared_bps = set1.intersection(set2)
    
    # Remove redundant BP ancestors.
    return remove_ancestors(shared_bps)

### Define a function to count the number of GO IDs left (this assumes that all entries with zero GO IDs have been removed)

In [5]:
def count_elements(string, sep):
    '''Count the number of elements stored in a string
    separated by a specific substring. 
    
    Parameters:
    string (str): List of elements.
    sep (str): String used as a separator in the string.
    '''
    # Return 0 if there are no elements.
    if string == '' or string == None: return 0
    
    # Initial count is 0.
    count = 0
    
    # Iterate thru every char in the string.
    for char in string:
        
        # If the char is a separator.
        if char == sep:
            
            # Increase the count.
            count += 1
    
    # Increase count to account for last item. Example:
    # GO:0001234 | GO:0000123 has 2 items and 1 separator
    return count + 1

## Define a function that takes an ID and outputs its ancestors organized by level

In [6]:
def get_to_root(term, root, ancestors):
    '''Return a list of IDs labeled by level, leading to the root.
    
    Parameters:
    term: The term having parents, item_id, and level as properties.
    root: The hierarchical root, such as 'DOID:4' or 'GO:0008150'.
    ancestors: Dictionary storing the term ancestors and their levels.
    '''
    # Iterate thru every parent the term has.
    for p in term.parents:

        # Base case (root is reached). 
        if p.item_id == root:
            
            # Add the last level to the dictionary. 
            ancestors[0] = {root}
            
            # Return the dictionary. 
            return ancestors

        # Recursive step (store the parents in their levels).
        try:
            # Store the parent in its level. 
            ancestors[p.level] = ancestors[p.level].union({p.item_id})
            
        except:
            # Create the level if it does not exist. 
            ancestors[p.level] = {p.item_id}
            
        finally:
            # Associate the parents of the parent to their levels. 
            get_to_root(p, root, ancestors)

    # Return the dictionary of ancestors. 
    return ancestors

def get_ancestors(term, root):
    '''Return a list of IDs labeled by level, leading to the root.
    
    Parameters:
    term (str): The term for which all the parents (leading up 
    to the root node) will be obtained.
    root (str): The hierarchical root, such as 'DOID:4' or 'GO:0008150'.
    '''
    # Try to use the ontology to define the term.
    try:
        term = ontology[term]
        
    except KeyError:
        # The key does not exist. Assume its level is 1 since
        # only the root can be at level 0.
        return {1: {term}}
    
    # Create an empty dictionary to associate the parents of the ID
    # to their corresponding levels. The root note is at level 0.
    ancestors = {}
    ancestors[term.level] = {term.item_id}     
    
    # Associate the parents of the term to their levels. 
    return get_to_root(term, root, ancestors)

## Define a funciton that returns a set of ancestors without their level

In [7]:
def ancestor_dict2set(ancestors):
    '''Return the set of IDs leading to the root.

    Parameters:
    ancestors (dict): The dictionary with a list of IDs labeled by 
    level, leading to the root.
    '''
    # Create an empty set.
    terms = set()
    
    # Iterate thru every level in the ancestors.
    for values in ancestors.values():
        
        # Add the ancestors to the list.
        terms = terms.union(values)
        
    # Return the list.
    return terms

## Define a function that converts a dictionary of ancestors into a set

In [8]:
def get_ancestor_set(term, root):
    '''Return a list of IDs leading to the root.

    Parameters:
    term (str): The term for which all the parents (leading up 
    to the root node) will be obtained.
    root (str): The hierarchical root, such as 'DOID:4' or 'GO:0008150'.
    '''
    # Get the ancestors, labeled by level.
    labeled_ancestors = get_ancestors(term, root)
    
    # Return the unlabeled ancestors stored in a set.
    return ancestor_dict2set(labeled_ancestors)

## Define a function that takes a string of GO IDs and returns a list containing the GO IDs and their parents

In [9]:
def get_all_ancestors(terms, sep = None, join_sep = None):
    '''Take a string containing GO IDs and return a list containing 
    the GO IDs and their parents.
    
    Parameters:
    terms (str): The string containing the GO IDs. The IDs
    may be separated by a certain substring.
    sep (str): The substring used to separate the elements
    in the string.
    join_sep (str): The string used to separate the list
    of GO IDs before it is returned to the user.
    '''
    # Split the terms using the given separator.
    if sep:
        terms = terms.split(sep)
       
    # Create an empty set of ancestors.
    ancestors = set()
    
    # Iterate thru every term in the IDs.
    for term in terms:
        
        # Get all the ancestors of the term leading to the root.
        term_ancestors = get_ancestor_set(term, 'GO:0008150')
        
        # Add the ancestors to the set.
        ancestors = ancestors.union(term_ancestors)
        
    # Return the ancestor set as a string 
    # separated by the given separator.
    if join_sep:
        return join_sep.join(ancestors)
    
    # Return the ancestor set.
    return ancestors

### List defined functions (except helper functions)
- Create an acyclic-directed graph using the GO.obo file.
- Define remove_ancestors: Take a list of BP IDs and remove redundant ID ancestors.
- Define get_shared_bps_no_ancestors: Return the set of BP terms that two diseases share after removing redundant BP term ancestors.
- Define count_elements: Count the number of GO IDs left (assumes that entries with zero elements are empty or null).
- Helper function get_to_root: Return a list of IDs labeled by level, leading to the root.
- Helper function get_ancestors: Return a list of IDs labeled by level, leading to the root.
- Helper function ancestor_dict2set: Return a set of ancestors without their level.
- Helper function get_ancestor_set: Convert a dictionary of ancestors into a set.
- Define get_all_ancestors: Take a string containing GO IDs and return a list containing the GO IDs and their parents.

In [10]:
print('''Create an acyclic-directed graph using the GO.obo file.
Define remove_ancestors: Take a list of BP IDs and remove redundant ID ancestors. 
Define get_shared_bps_no_ancestors: Return the set of BP terms that two diseases share after removing redundant BP term ancestors.
Define count_elements: Count the number of GO IDs left (assumes that entries with zero elements are empty or null).
Define get_all_ancestors: Take a string of GO IDs and return a list containing the GO IDs and their parents.
Define count_elements: Count the number of GO IDs left (assumes that entries with zero elements are empty or null).''')

Create an acyclic-directed graph using the GO.obo file.
Define remove_ancestors: Take a list of BP IDs and remove redundant ID ancestors. 
Define get_shared_bps_no_ancestors: Return the set of BP terms that two diseases share after removing redundant BP term ancestors.
Define count_elements: Count the number of GO IDs left (assumes that entries with zero elements are empty or null).
Define get_all_ancestors: Take a string of GO IDs and return a list containing the GO IDs and their parents.
Define count_elements: Count the number of GO IDs left (assumes that entries with zero elements are empty or null).
