# Gene-Disease Associations File

This program creates a gene-disease associations file automatically using the files downloaded from [OMIM.org](https://www.omim.org/) and [GeneOntology.org](http://geneontology.org/).

This program requires the following files:

    morbidmap.txt
    mim2gene.txt
    goa_human.gaf
    go.obo
    GO-BP ID Ancestor Functions

The program outputs the following files:

    Gene-Disease Associations, All GO IDs.csv
    Gene-Disease Associations, No GO ID Ancestors.csv

The program can take a couple of minutes to produce these files and may seem unresponsive.

## Define the filenames

In [1]:
# Import pandas module to open files.
import pandas 

# Default file names.
morbid_map = "morbidmap.txt" 
mim_2_gene = "mim2gene.txt"
human_go_a = "goa_human.gaf"
g_ontology = "go.obo"

## Access the files and treat them as .csv files

The Gene Ontology file "go.obo" does not follow a tabular format, so the pandas module cannot deal with it. The Gene Ontology file will be processed later. 

In [2]:
'''Parameters used for opening files:

    filename (str): The filename to work with.
    sep (str): The separator, such as commas or tabs.
    comment (char): The character that marks lines as comments.
    usecols (list): The columns to use.
    index_col (list): The column to use as index.
    skiprows (int): The number of lines to skip.
    dtype: The data type.
    fillna (str): The string to use to fills empty fields.
    names (str): The column names. 
'''

morbid_map_file = pandas.read_csv(
    morbid_map, 
    sep='\t', 
    comment='#', 
    usecols=['Phenotype','MIM Number','Gene Symbol'], 
    names=('Phenotype',  
           'Gene Symbol', 
           'MIM Number', 
           'Cyto Location'))
    
mim_2_gene_file = pandas.read_csv(
    mim_2_gene, 
    sep='\t', 
    comment='#', 
    index_col = ['MIM Number'],
    usecols=['MIM Number', 'Entrez Gene ID (NCBI)',
             'Approved Gene Symbol (HGNC)'], 
    dtype = {'Entrez Gene ID (NCBI)' : 'str'},
    names=('MIM Number', 
           'MIM Entry Type', 
           'Entrez Gene ID (NCBI)', 
           'Approved Gene Symbol (HGNC)',
           'Ensembl Gene ID (Ensembl)'))

# skiprows=31 and comment='!' cause an error. 
human_go_a_file = pandas.read_csv(
    human_go_a, 
    sep='\t',  
    skiprows=30,
    usecols=['Gene Symbol','Qualifier','GO ID','Aspect'],
    names=('DB', 
           'DB Symbol', 
           'Gene Symbol', 
           'Qualifier',
           'GO ID',
           'Reference',
           'Evidence Code',
           'With',
           'Aspect',
           'DB Name',
           'Synonym',
           'DB Type',
           'Taxon ID',
           'Date',
           'Assigned By?')).fillna('')  
            #Fill empty spaces with '' instead of NaN
            #This is needed to search empty fields 
            #Example: if row['Qualifier'] != '':

## Filter the human GO annotations file so that only biological processes are left

In [3]:
# Filter human GO annotations file by biological process.
human_go_a_file = human_go_a_file.loc[human_go_a_file['Aspect'] == 'P'] 

# Drop the 'Aspect' column since only biological processes are left.
human_go_a_file = human_go_a_file.drop(columns =['Aspect'])

# Drop rows that have the same gene symbol, qualifier, and GO ID.
human_go_a_file = human_go_a_file.drop_duplicates()

# Reset the index so it starts at zero, and drop the previous index.
human_go_a_file = human_go_a_file.reset_index(drop = True)

# Set gene symbol column as index to speed up search by gene symbol.
human_go_a_file = human_go_a_file.set_index(['Gene Symbol'])

#### Display the contents from the human GO annotations files

In [4]:
# For visualization only: may delete code line.
# It is important that the qualifier column 
# show '' instead of NaN for empty fields.
human_go_a_file

Unnamed: 0_level_0,Qualifier,GO ID
Gene Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
IGKV3-7,,GO:0002250
IGKV1D-42,,GO:0002250
IGLV4-69,,GO:0002250
IGLV8-61,,GO:0002250
IGLV4-60,,GO:0002250
...,...,...
IGLV10-54,,GO:0002377
MEIS1,,GO:0008284
SLC9C2,,GO:0051453
LEFTY2,,GO:0030509


#### Display the contents from the morbid map file

In [5]:
# For visualization only: may delete code line.
morbid_map_file

Unnamed: 0,Phenotype,Gene Symbol,MIM Number
0,"17,20-lyase deficiency, isolated, 202110 (3)","CYP17A1, CYP17, P450C17",609300
1,"17-alpha-hydroxylase/17,20-lyase deficiency, 2...","CYP17A1, CYP17, P450C17",609300
2,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787
3,"2-aminoadipic 2-oxoadipic aciduria, 204750 (3)","DHTKD1, KIAA1630, AMOXAD, CMT2Q",614984
4,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301
...,...,...,...
7940,"{West nile virus, susceptibility to}, 610379 (3)","CCR5, CMKBR5, CCCKR5, IDDM22",601373
7941,"{Wilms tumor 6, susceptibility to}, 616806 (3)","REST, NRSF, WT6, GINGF5, HGF5, DFNA27",600571
7942,"{Wilms tumor susceptibility-5}, 601583 (3)","POU6F2, WTSL, WT5",609062
7943,"{Yao syndrome}, 617321 (3)","NOD2, CARD15, IBD1, CD, YAOS, BLAUS",605956


#### Display the contents from the MIM to gene file

In [6]:
# For visualization only: may delete code line.
mim_2_gene_file

Unnamed: 0_level_0,Entrez Gene ID (NCBI),Approved Gene Symbol (HGNC)
MIM Number,Unnamed: 1_level_1,Unnamed: 2_level_1
100050,,
100070,100329167,
100100,,
100200,,
100300,,
...,...,...
618933,344558,SH3RF3
618934,401145,CCSER1
618936,128497,SPATA25
618937,221908,PPP1R35


## Create a gene-disease associations table and populate it with data from morbid map file

Using the morbid map file:
- Obtain the disease names and MIM numbers for each disease
    1. Iterate through every line in morbid map
    1. Add the disease names and MIM numbers to corresponding lists
    1. Assign each of these lists to a new column
- Obtain the MIM numbers for each disease
- Label each disease as coming from the OMIM database


In [7]:
# Create new dataframe to store info.
gene_disease = pandas.DataFrame(columns=['DB',
                                         'DB ID',
                                         'Map Key',
                                         'Disease',
                                         'MIM Number',
                                         'Gene Symbol',
                                         'Gene ID',
                                         'GO-BP ID Count',
                                         'GO-BP ID',
                                         'GO Definition Count', 
                                         'GO Definition'])

# Define lists to store the disease names, mapping keys, genes, and IDs.
disease_list = []
key_list = []
db_id_list = [] 


# The for-loop accesses every line in the morbidmap file.
for index, row in morbid_map_file.iterrows():
    
    # Example phenotype:
    # 17,20-lyase deficiency, isolated, 202110 (3)
    phenotype = row['Phenotype']
    
    # Get the disease's phenotype mapping key:
    # The phenotype mapping key is found in the last 3 characters.
    # 17,20-lyase deficiency, isolated, 202110 (3)
    # has phenotype_mapping_key = '(3)'
    phenotype_mapping_key = phenotype[-3:]
    
    # Add the phenotype mapping key to the list:
    # Every disease has a mapping key; no need for exception handling.
    # This will help to group diseases based on phenotype mapping key.
    key_list += [phenotype_mapping_key]
        
    try:
        
        # Obtain ID from phenotype:
        # 17,20-lyase deficiency, isolated, 202110 (3)
        # has id = 202110
        db_id = phenotype[ len(phenotype)-10 : len(phenotype)-4 ]
        db_id_list += [int(db_id)]
        
    except ValueError:
        
        # A ValueError occurs whenever the substring cannot be
        # converted into an integer. 
        # This means the database ID is empty.
        db_id_list += ['']

        # Remove the phenotype mapping key:
        # before: 17,20-lyase deficiency, isolated (3)
        # after:  17,20-lyase deficiency, isolated
        disease_list += [phenotype[:len(phenotype)-4]]  
        
    else:
        
        # Remove the database ID and phenotype mapping key
        # if no ValueError occurred:
        # before: 17,20-lyase deficiency, isolated, 202110 (3)
        # after:  17,20-lyase deficiency, isolated
        disease_list += [phenotype[:len(phenotype)-12]]

# Assign the list of diseases to the 'Disease' column.
# This provides all the disease names in the table.
gene_disease['Disease'] = disease_list

# Assign the list of database IDs to the 'DB ID' column. 
# This provides all the database ID numbers for all the diseases.
gene_disease['DB ID'] = db_id_list

# Assign the list of phenotype mapping keys to the 'Map Key' column.
# This provides all the phenotype mapping keys for all the diseases.
gene_disease['Map Key'] = key_list

# Fill every row in the 'DB' (database) column with 'OMIM'.
# This labels every disease as coming from the OMIM database.
gene_disease['DB'] = 'OMIM'

# Copy the 'MIM Number' column into the 'MIM Number' column.
# The MIM numbers will be used to find gene IDs.
gene_disease['MIM Number'] = morbid_map_file['MIM Number'] 

#### Display the gene-disease associations file, and notice that the MIM numbers will serve to obtain the actual gene IDs

In [8]:
#For visualization only: may delete code line
gene_disease

Unnamed: 0,DB,DB ID,Map Key,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,(3),"17,20-lyase deficiency, isolated",609300,,,,,,
1,OMIM,202110,(3),"17-alpha-hydroxylase/17,20-lyase deficiency",609300,,,,,,
2,OMIM,616034,(3),"2,4-dienoyl-CoA reductase deficiency",615787,,,,,,
3,OMIM,204750,(3),2-aminoadipic 2-oxoadipic aciduria,614984,,,,,,
4,OMIM,610006,(3),2-methylbutyrylglycinuria,600301,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
7940,OMIM,610379,(3),"{West nile virus, susceptibility to}",601373,,,,,,
7941,OMIM,616806,(3),"{Wilms tumor 6, susceptibility to}",600571,,,,,,
7942,OMIM,601583,(3),{Wilms tumor susceptibility-5},609062,,,,,,
7943,OMIM,617321,(3),{Yao syndrome},605956,,,,,,


## Remove the entries that do not have the phenotype mapping key '(3)'

In [9]:
# Get the diseases that have a mapping key of '(3)'.
gene_disease = gene_disease[gene_disease['Map Key'] == '(3)']

# Reset the index numbering and drop the previous index.
gene_disease = gene_disease.reset_index(drop = True)

# Remove the 'Map Key' column since it is no longer needed.
gene_disease = gene_disease.drop(['Map Key'], axis = 1)

#### Display the result of removing entries that do not have the phenotype mapping key '(3)', and then removing the 'Map Key' column

In [10]:
# For visualization only: may delete code line.
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated",609300,,,,,,
1,OMIM,202110,"17-alpha-hydroxylase/17,20-lyase deficiency",609300,,,,,,
2,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,,,,,,
3,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,,,,,,
4,OMIM,610006,2-methylbutyrylglycinuria,600301,,,,,,
...,...,...,...,...,...,...,...,...,...,...
6680,OMIM,610379,"{West nile virus, susceptibility to}",601373,,,,,,
6681,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,,,,,,
6682,OMIM,601583,{Wilms tumor susceptibility-5},609062,,,,,,
6683,OMIM,617321,{Yao syndrome},605956,,,,,,


## Remove entries that do not have a DB ID

In [11]:
# Get the diseases that do not have an empty DB ID.
gene_disease = gene_disease[gene_disease['DB ID'] != '']

# Reset the index numbering and drop the previous index.
gene_disease = gene_disease.reset_index(drop = True)

#### Display the result of removing entries that do not have a DB ID

In [12]:
# For visualization only: may delete code line.
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated",609300,,,,,,
1,OMIM,202110,"17-alpha-hydroxylase/17,20-lyase deficiency",609300,,,,,,
2,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,,,,,,
3,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,,,,,,
4,OMIM,610006,2-methylbutyrylglycinuria,600301,,,,,,
...,...,...,...,...,...,...,...,...,...,...
6550,OMIM,610379,"{West nile virus, susceptibility to}",601373,,,,,,
6551,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,,,,,,
6552,OMIM,601583,{Wilms tumor susceptibility-5},609062,,,,,,
6553,OMIM,617321,{Yao syndrome},605956,,,,,,


## Use MIM numbers to obtain corresponding gene symbols and gene IDs

In [13]:
# Use MIM Number sorted in disease file to get gene symbols and IDs.
# Iterate thru every line in the gene_disease file.
for index, row in gene_disease.iterrows():
    
    # Get MIM number from row in the gene_disease file.
    mim_num = row['MIM Number']
    
    # Get matching gene symbol from row in the mim2gene file. 
    gene = mim_2_gene_file.at[mim_num, 'Approved Gene Symbol (HGNC)']
    
    # Get matching gene id from row in the mim2gene file. 
    gene_id = mim_2_gene_file.at[mim_num, 'Entrez Gene ID (NCBI)']
    
    # Store gene symbol in the gene_disease file. 
    gene_disease.at[index, 'Gene Symbol'] = gene
    
    # Store gene id in the gene_disease file. 
    gene_disease.at[index, 'Gene ID'] = gene_id

#### Display the result of using MIM numbers to obtain corresponding gene symbols and gene IDs 

The 'GO-BP ID' column contains gene symbols that will later be used to obtain the actual GO-BP ID values

In [14]:
# For visualization only: may delete code line.
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated",609300,CYP17A1,1586,,,,
1,OMIM,202110,"17-alpha-hydroxylase/17,20-lyase deficiency",609300,CYP17A1,1586,,,,
2,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,,,,
3,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,,,,
4,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,,,,
...,...,...,...,...,...,...,...,...,...,...
6550,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,,,,
6551,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,,,,
6552,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,,,,
6553,OMIM,617321,{Yao syndrome},605956,NOD2,64127,,,,


## Use the gene symbols to obtain GO-BP ID values and update the 'GO-BP ID Count' column

This process can take a minute to finish.

In [15]:
# Get gene symbol stored in gene_disease row and use that 
# gene symbol to find corresponding GO-BP qualifiers and IDs.

# Iterate thru every row in the gene_disease file.
for index1, row1 in gene_disease.iterrows():
    
    # Get gene symbol from row in gene_disease file.
    gene_symbol = row1['Gene Symbol']
    
    try:
        
        # Get GO-BP IDs from human_go_a file that match the gene symbol.
        go_ids = human_go_a_file.at[gene_symbol, 'GO ID']
        
        # Get qualifiers from human_go_a file that match the gene symbol.
        qualifiers = human_go_a_file.at[gene_symbol, 'Qualifier']

    except KeyError:
        
        # Key error means that the gene symbol does not exist:
        # human_go_a file has no GO-BP IDs or qualifiers for this gene.
        go_ids = []
        qualifiers = []
    
    # Create an empty list of GO-BP IDs with qualifiers.
    go_list = []

    # Concatenate each GO ID and its corresponding qualifier.
    for qualifier, go_id in zip(qualifiers, go_ids):    
        
        # Add a space to the qualifier if a qualifier exists.
        qualifier = qualifier + ' ' if qualifier else ''
        
        # Concatenate the qualifier and the GO ID.
        go_list += [qualifier + go_id]

    # Count the number of GO-BP IDs found for the gene symbol.
    go_count = len(go_list)
    
    #Join the list of GO IDs using the string ' | '.
    go_list = ' | '.join(go_list)
    
    # Store the number of GO-BP IDs in the 'GO-BP ID Count' column.
    # Specify the row number in the gene_disease table using index1.
    gene_disease.at[index1, 'GO-BP ID Count'] = go_count
    
    # Store the GO-BP IDs in the 'GO-BP ID Count' column.
    gene_disease.at[index1, 'GO-BP ID'] = go_list

#### Display the result of using the gene symbols to obtain GO-BP IDs and update the 'GO-BP ID Count' column

GO-BP ID values with NOT qualifiers will be shown as 'NOT GO:0001234'.

In [16]:
# For visualization only: may delete code line.
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated",609300,CYP17A1,1586,8,GO:0006694 | GO:0006702 | GO:0006704 | GO:0007...,,
1,OMIM,202110,"17-alpha-hydroxylase/17,20-lyase deficiency",609300,CYP17A1,1586,8,GO:0006694 | GO:0006702 | GO:0006704 | GO:0007...,,
2,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0006741 | GO:0016310 | GO:0019674,,
3,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,3,GO:0002244 | GO:0006091 | GO:0006096,,
4,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0006631 | GO:0009083 | GO:0055114,,
...,...,...,...,...,...,...,...,...,...,...
6550,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,21,GO:0000165 | GO:0002407 | GO:0006816 | GO:0006...,,
6551,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,37,GO:0000122 | GO:0000381 | GO:0001666 | GO:0002...,,
6552,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:0006355 | GO:0006366 | GO:0007402 | GO:0007...,,
6553,OMIM,617321,{Yao syndrome},605956,NOD2,64127,53,GO:0000187 | GO:0002221 | GO:0002367 | GO:0002...,,


## Remove entries where the GO-BP ID count is zero

In [17]:
# Create a list of rows that only includes rows with GO definitions.
entries_width_definitions = gene_disease['GO-BP ID Count'] != 0

# Update gene_disease table to only contain rows with GO definitions.
gene_disease = gene_disease.loc[entries_width_definitions]

# Reset the index so it starts at zero, and drop the previous index.
gene_disease = gene_disease.reset_index(drop = True)

#### Display the result of removing entries with zero GO-BP IDs  

In [18]:
# For visualization only: may delete code line.
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated",609300,CYP17A1,1586,8,GO:0006694 | GO:0006702 | GO:0006704 | GO:0007...,,
1,OMIM,202110,"17-alpha-hydroxylase/17,20-lyase deficiency",609300,CYP17A1,1586,8,GO:0006694 | GO:0006702 | GO:0006704 | GO:0007...,,
2,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0006741 | GO:0016310 | GO:0019674,,
3,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,3,GO:0002244 | GO:0006091 | GO:0006096,,
4,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0006631 | GO:0009083 | GO:0055114,,
...,...,...,...,...,...,...,...,...,...,...
6284,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,21,GO:0000165 | GO:0002407 | GO:0006816 | GO:0006...,,
6285,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,37,GO:0000122 | GO:0000381 | GO:0001666 | GO:0002...,,
6286,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:0006355 | GO:0006366 | GO:0007402 | GO:0007...,,
6287,OMIM,617321,{Yao syndrome},605956,NOD2,64127,53,GO:0000187 | GO:0002221 | GO:0002367 | GO:0002...,,


## Define a function that removes BP terms with the 'NOT' qualifier and their children

In [19]:
def remove_not_qualifiers(terms):
    '''Remove GO-BP terms that contain the 'NOT' qualifier. 
    Biological processes with the 'NOT' qualifier are not carried
    out by certain gene products, thus the terms and all of their
    children must be removed. 
    
    Different gene products associated to the same disease may
    perform these biological processes, but should not be removed
    because they belong to another gene.
    
    Parameters:
    terms (str): The pipe-separated list of GO-BP terms to filter.
    '''
    # Convert the string into a set of BP terms.
    terms = set(terms.split(' | '))
    
    # Create an empty set to store the BP terms with the 'NOT'
    # qualifier and their children.
    not_bps = set()
    
    # Iterate thru every BP term.
    for term in terms:
        
        # Select the BP terms with the 'NOT' qualifier.
        if 'NOT' in term:
            
            # Add the term to the set of terms with 'NOT' qualifier.
            not_bps = not_bps.union({term})
            
            # Use gene ontology to define the term.
            term = go[term[4:]]
            
            # Add term children to set of terms with 'NOT' qualifier.
            not_bps = not_bps.union(term.get_all_children())
    
    # Remove all the terms with 'NOT' qualifiers and their children.
    terms = terms.difference(not_bps)
    
    # Return the remaining BP terms as a pipe-separated string.
    return (' | ').join(terms)

## Import GO-BP ID Ancestor Functions

In [20]:
# Import functionality from python notebook.
# Notebook automatically uses OBO parser to define gene ontology.
%run "GO-BP ID Ancestor Functions.ipynb"

go.obo: fmt(1.2) rel(2020-06-01) 47,233 GO Terms
Create an acyclic-directed graph using the GO.obo file.
Define remove_ancestors: Take a list of BP IDs and remove redundant ID ancestors. 
Define get_shared_bps_no_ancestors: Return the set of BP terms that two diseases share after removing redundant BP term ancestors.
Define count_elements: Count the number of GO IDs left (assumes that entries with zero elements are empty or null).
Define get_all_ancestors: Take a string of GO IDs and return a list containing the GO IDs and their parents.
Define count_elements: Count the number of GO IDs left (assumes that entries with zero elements are empty or null).


### Remove BP terms with the 'NOT' qualifier

In [21]:
# Create directed acyclilc-graph of the 'go.obo' file.
go = obo_parser.GODag(g_ontology)

# Remove BP terms with 'NOT' qualifier. The 
# remove_not_qualifiers depends on the gene ontology file
# loaded by "GO-BP ID Ancestor Functions.ipynb".
gene_disease['GO-BP ID'] = gene_disease['GO-BP ID'].apply(
    remove_not_qualifiers)

# Update the GO-BP ID count. The count_elements function
# comes from "Remove GO-BP ID Ancestors.ipynb". 
gene_disease['GO-BP ID Count'] = gene_disease['GO-BP ID'].apply(
    count_elements, sep = '|')

go.obo: fmt(1.2) rel(2020-06-01) 47,233 GO Terms


## Find all the diseases that have DB ID duplicates and then create a list with the first instance of every duplicate

In [22]:
# Get list of entries in gene_disease file that have the same DB ID.
duplicates = gene_disease.duplicated(['DB ID'], keep = False)

# Use duplicates' list to get duplicates's table from gene_disease.
duplicates = gene_disease[duplicates]

# Leave the first instance of the duplicate DB IDs, remove the rest.
no_duplicates = duplicates.drop_duplicates('DB ID')

# Use the DB ID as index in order to speed up search.
duplicates = duplicates.set_index(['DB ID'])

### Display the result of finding all the diseases that have DB ID duplicates

In [23]:
# For visualization only: may delete code line.
duplicates

Unnamed: 0_level_0,DB,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
DB ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
202110,OMIM,"17,20-lyase deficiency, isolated",609300,CYP17A1,1586,8,GO:0006702 | GO:0042446 | GO:0006694 | GO:0008...,,
202110,OMIM,"17-alpha-hydroxylase/17,20-lyase deficiency",609300,CYP17A1,1586,8,GO:0006702 | GO:0042446 | GO:0006694 | GO:0008...,,
614279,OMIM,46XY sex reversal 8,600450,AKR1C2,1646,14,GO:0071395 | GO:0008202 | GO:0051897 | GO:0008...,,
274270,OMIM,5-fluorouracil toxicity,612779,DPYD,1806,8,GO:0006214 | GO:0006212 | GO:0006145 | GO:0006...,,
105200,OMIM,"?Amyloidosis, familial visceral",109700,B2M,567,40,GO:0019885 | GO:0002480 | GO:0051289 | GO:0071...,,
...,...,...,...,...,...,...,...,...,...
612076,OMIM,"{Uric acid concentration, serum, QTL 2}",606142,SLC2A9,56606,5,GO:0015747 | GO:1904659 | GO:0046415 | GO:0008...,,
188050,OMIM,"{Venous thromboembolism, susceptibility to}",603924,HABP2,3026,2,GO:0007155 | GO:0006508,,
188050,OMIM,"{Venous thrombosis, protection against}",134570,F13A1,2162,5,GO:0019221 | GO:0072378 | GO:0002576 | GO:0018...,,
122700,OMIM,{Warfarin sensitivity},300746,F9,2158,5,GO:0007597 | GO:0031638 | GO:0006508 | GO:0006...,,


#### Display the result of creating a list with the first instance of every duplicate

In [24]:
# For visualization only: may delete code line.
no_duplicates

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated",609300,CYP17A1,1586,8,GO:0006702 | GO:0042446 | GO:0006694 | GO:0008...,,
30,OMIM,614279,46XY sex reversal 8,600450,AKR1C2,1646,14,GO:0071395 | GO:0008202 | GO:0051897 | GO:0008...,,
32,OMIM,274270,5-fluorouracil toxicity,612779,DPYD,1806,8,GO:0006214 | GO:0006212 | GO:0006145 | GO:0006...,,
48,OMIM,105200,"?Amyloidosis, familial visceral",109700,B2M,567,40,GO:0019885 | GO:0002480 | GO:0051289 | GO:0071...,,
68,OMIM,615991,?Bardet-Biedl syndrome 14,610142,CEP290,80184,13,GO:0043312 | GO:0090316 | GO:0000086 | GO:0070...,,
...,...,...,...,...,...,...,...,...,...,...
6167,OMIM,168600,"{Parkinson disease, age of onset, modifier}",300144,GLUD2,2747,4,GO:0006537 | GO:0006538 | GO:0055114 | GO:0006536,,
6212,OMIM,180300,"{Rheumatoid arthritis, progression of}",124092,IL10,3586,71,GO:0035722 | GO:0014823 | GO:0051091 | GO:0045...,,
6219,OMIM,604302,"{Rheumatoid arthritis, systemic juvenile, susc...",153620,MIF,4282,39,GO:0043066 | GO:0030336 | GO:0035722 | GO:0030...,,
6227,OMIM,181500,"{Schizophrenia, susceptibility to}",601525,CHI3L1,1116,19,GO:0034612 | GO:0030324 | GO:0051897 | GO:0070...,,


## Merge entries that have the same DB ID 
Entries with the same database ID are the same disease even if they have different disease names.

In [25]:
# Convert the gene_disease entries into strings:
# String values can be concatenated but integers cannot.
# This makes concatenating MIM Numbers and DB IDs less complicated.
gene_disease = gene_disease.astype(str)

def join(elements, sep = ' | '):
    '''Join list elements using the pipe character after removing
    duplicates and converting integers into strings.'''
    if sep:
        # Convert integer elements to strings (doesn't affect strings):
        # For example [1, 1, 2] into ['1', '1', '2']
        elements = set(map(str, elements))
        
        # Join the elements using the separator string: for example:
        # ['ID 1 | ID 1', 'ID 2'] into ['ID 1 | ID 1 | ID 2']
        elements = sep.join(elements)
        
        # Split elements using the separator and remove duplicates:
        # For example 'ID 1 | ID 1 | ID 2' into ('ID 1, ID 2')
        elements = set(elements.split(sep))
        
        # Rejoin the unique elements using the separator:
        # For example ('ID 1, ID 2') into 'ID 1 | ID 2'
        elements = sep.join(elements)
        
        # Return the joined elements.
        return elements
    
    # If not separator, return the original list.
    return elements

# Iterate thru every row in the no_duplicates file
for index, row1 in no_duplicates.iterrows():
    
    # Get gene symbol from row in no_duplicates file
    db_id = row1['DB ID']
    
    # Get diseases from duplicates file that have matching DB IDs
    disease = duplicates.at[db_id, 'Disease']
    
    # Get MIM numbers from duplicates file that have matching DB IDs.
    mim_num = duplicates.at[db_id, 'MIM Number']
    # Get gene symbols from duplicates file that have matching DB IDs.
    gene = duplicates.at[db_id, 'Gene Symbol']
    # Get gene IDs from duplicates file that have matching DB IDs.
    gene_id = duplicates.at[db_id, 'Gene ID']
    # Get GO-BP IDs from duplicates file that have matching DB IDs.
    go_ids = duplicates.at[db_id, 'GO-BP ID']
    
    # Count the number of GO-BP IDs found for the gene symbol.
    go_count = len(join(go_ids).split(' | '))
    
    # Join the list of diseases using the string ' | '.
    # Store the diseases in the 'Disease' column.
    # Specify the row number in the gene_disease table using index.
    # Do the same for mim_num, gene, gene_id, and go_ids.
    gene_disease.at[index, 'Disease'] = join(disease)
    gene_disease.at[index, 'MIM Number'] = join(mim_num)
    gene_disease.at[index, 'Gene Symbol'] = join(gene)
    gene_disease.at[index, 'Gene ID'] = join(gene_id)
    gene_disease.at[index, 'GO-BP ID'] = join(go_ids)
    
    # Store GO-BP ID count in 'GO-BP ID Count' column.
    gene_disease.at[index, 'GO-BP ID Count'] = go_count
    
# Remove entries from gene_disease file that have the same DB ID:
# Leave first instance of the DB ID and remove remaining duplicates.
gene_disease = gene_disease.drop_duplicates('DB ID')

# Reset index numbering so it is continuous, and drop previous index.
gene_disease = gene_disease.reset_index(drop = True)

## Display the result of merging diseases that have the same DB ID

In [26]:
# For visualization only: may delete code line.
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated | 17-alpha-hy...",609300,CYP17A1,1586,8,GO:0006702 | GO:0042446 | GO:0006694 | GO:0008...,,
1,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0019674 | GO:0006741 | GO:0016310,,
2,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,3,GO:0006091 | GO:0002244 | GO:0006096,,
3,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0055114 | GO:0006631 | GO:0009083,,
4,OMIM,273750,3-M syndrome 1,609577,CUL7,9820,14,GO:0006511 | GO:0050775 | GO:0007030 | GO:0043...,,
...,...,...,...,...,...,...,...,...,...,...
5412,OMIM,606579,{Vitiligo-associated multiple autoimmune disea...,606636,NLRP1,22861,12,GO:0006919 | GO:0042742 | GO:0045087 | GO:0050...,,
5413,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,21,GO:0007166 | GO:0014808 | GO:0071222 | GO:0006...,,
5414,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,37,GO:0045892 | GO:0046676 | GO:1903204 | GO:2000...,,
5415,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:0007402 | GO:0007417 | GO:0030154 | GO:0006...,,


#### Display gene ontology file (the following code segment is only used for visualization)

In [27]:
# Open the gene ontology file.
# Skip the first 28 lines and any blank lines.
g_ontology_file = pandas.read_csv(
    g_ontology, 
    skiprows=28,
    skip_blank_lines = True)

# Display the table.
g_ontology_file

Unnamed: 0,[Term]
0,id: GO:0000001
1,name: mitochondrion inheritance
2,namespace: biological_process
3,"def: ""The distribution of mitochondria, includ..."
4,"synonym: ""mitochondrial inheritance"" EXACT []"
...,...
558696,name: term tracker item
558697,namespace: external
558698,xref: IAO:0000233
558699,is_metadata_tag: true


## Create a dictionary of every GO-BP ID and its definition

In [28]:
# Create an empty dictionary of GO IDs and definitions.
go_dictionary = {}

# Definition has a GO ID only whenever a new GO term has been found.
has_id = False

# Iterate thru every line in the go.obo file.
for line in open(g_ontology):
    
    #Check if the line is a GO ID
    if line[0:3] == 'id:':
        
        # Key looks like GO:0000000 and always spans chars 4 thru 14.
        go_id = line[4:14]
        
        # The GO ID will serve as a key, so set flag to True.
        has_id = True

    # Check if the line is a GO definition and if it has a key.
    elif line[0:5] == '"def:' and has_id:

        # GO definition is surrounded by quotes and additional info.
        # Split string and extract only the definition.
        definition = line.split('"')[3]
        
        # Assign the definition to the GO ID.
        go_dictionary[go_id] = definition
        go_id = False  

#### Display gene ontology file (the following code segment is only used for visualization)

In [29]:
# For visualization only: may delete code line.
pandas.DataFrame.from_dict(go_dictionary, orient = 'index')

Unnamed: 0,0
GO:0000001,"The distribution of mitochondria, including th..."
GO:0000002,The maintenance of the structure and integrity...
GO:0000003,The production of new individuals that contain...
GO:0000005,OBSOLETE. Assists in the correct assembly of r...
GO:0000006,Enables the transfer of zinc ions (Zn2+) from ...
...,...
GO:2001313,The chemical reactions and pathways involving ...
GO:2001314,The chemical reactions and pathways resulting ...
GO:2001315,The chemical reactions and pathways resulting ...
GO:2001316,The chemical reactions and pathways involving ...


## Use the dictionary of GO-BP IDs and definitions to define the GO-BP IDs in the gene-disease table

In [30]:
# Iterate thru every row in gene_disease table.
for index, row in gene_disease.iterrows():
    
    # Obtain the list of GO-BP IDs from the gene_disease row.
    go_list = row['GO-BP ID']
    
    # Create an empty list where GO ID definitions will be stored.
    def_list = []
    
    # Create accumulator to keep track of the number of definitions.
    def_count = 0
    
    # Get each GO ID by splitting the string.
    go_list = go_list.split(' | ')
    
    # Iterate thru every GO ID in the list.
    for go_id in go_list:
        
        # Add the dictionary definition to definition list.
        def_list += [go_dictionary[go_id]]

        # Increase the definition count.
        def_count += 1
        
    # Define a separator to place between definitions.
    separator = ' | '
    
    # Convert the definition list into a string.
    def_list = separator.join(def_list)
    
    # Store definition count of the def_list in the 'GO Definition
    # Count' column and store definition list in the 'GO Definition'
    # column. Specify the row of the gene_disease table using index.
    gene_disease.at[index, 'GO Definition Count'] = def_count
    gene_disease.at[index, 'GO Definition'] = def_list

#### Display the result of using GO-BP IDs to find GO definitions

In [31]:
# Display the result of using GO-BP IDs to find GO definitions.
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated | 17-alpha-hy...",609300,CYP17A1,1586,8,GO:0006702 | GO:0042446 | GO:0006694 | GO:0008...,8,The chemical reactions and pathways resulting ...
1,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0019674 | GO:0006741 | GO:0016310,3,The chemical reactions and pathways involving ...
2,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,3,GO:0006091 | GO:0002244 | GO:0006096,3,The chemical reactions and pathways resulting ...
3,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0055114 | GO:0006631 | GO:0009083,3,A metabolic process that results in the remova...
4,OMIM,273750,3-M syndrome 1,609577,CUL7,9820,14,GO:0006511 | GO:0050775 | GO:0007030 | GO:0043...,14,The chemical reactions and pathways resulting ...
...,...,...,...,...,...,...,...,...,...,...
5412,OMIM,606579,{Vitiligo-associated multiple autoimmune disea...,606636,NLRP1,22861,12,GO:0006919 | GO:0042742 | GO:0045087 | GO:0050...,12,Any process that initiates the activity of the...
5413,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,21,GO:0007166 | GO:0014808 | GO:0071222 | GO:0006...,21,A series of molecular signals initiated by act...
5414,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,37,GO:0045892 | GO:0046676 | GO:1903204 | GO:2000...,37,"Any process that stops, prevents, or reduces t..."
5415,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:0007402 | GO:0007417 | GO:0030154 | GO:0006...,7,The cell fate determination process in which a...


## Remove BP term information

In [32]:
# Remove BP terms from gene_disease file.
no_bps = gene_disease.drop('GO Definition', axis = 1)

# Remove BP term count from gene_diease file.
no_bps = no_bps.drop('GO Definition Count', axis = 1)

#### Display result of removing BP term information

In [33]:
# For visualization only: may delete code line.
no_bps

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID
0,OMIM,202110,"17,20-lyase deficiency, isolated | 17-alpha-hy...",609300,CYP17A1,1586,8,GO:0006702 | GO:0042446 | GO:0006694 | GO:0008...
1,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0019674 | GO:0006741 | GO:0016310
2,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,3,GO:0006091 | GO:0002244 | GO:0006096
3,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0055114 | GO:0006631 | GO:0009083
4,OMIM,273750,3-M syndrome 1,609577,CUL7,9820,14,GO:0006511 | GO:0050775 | GO:0007030 | GO:0043...
...,...,...,...,...,...,...,...,...
5412,OMIM,606579,{Vitiligo-associated multiple autoimmune disea...,606636,NLRP1,22861,12,GO:0006919 | GO:0042742 | GO:0045087 | GO:0050...
5413,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,21,GO:0007166 | GO:0014808 | GO:0071222 | GO:0006...
5414,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,37,GO:0045892 | GO:0046676 | GO:1903204 | GO:2000...
5415,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:0007402 | GO:0007417 | GO:0030154 | GO:0006...


## Remove GO-BP ID ancestors to narrow down the association between genes and biological processes

Consider the terms: 

    GO:0006694, GO:0006702, GO:0006704, GO:0007548, GO:0008202, GO:0042446, GO:0042448, GO:0055114

The following ancestor chart shows that 3 out of the 8 GO terms are redundant. The GO terms without ancestors are: 

    GO:0006702, GO:0042448, GO:0006704, GO:0007548, GO:0055114

<img src="Ancestor Chart.png" width="900" height="900">

### Remove the ancestors from every entry in the gene-disease associations file

In [34]:
# Split terms using pipe character and convert list into a set.
gene_disease['GO-BP ID'] = gene_disease['GO-BP ID'].apply(
    lambda terms: set(terms.split(' | ')))

# Apply the remove_ancestors function to the 'GO-BP ID' column.
gene_disease['GO-BP ID'] = gene_disease['GO-BP ID'].apply(
    lambda terms:remove_ancestors(terms))

# Join the terms using the pipe character.
gene_disease['GO-BP ID'] = gene_disease['GO-BP ID'].apply(
    lambda terms: (' | ').join(terms))

### Update the GO ID count

In [35]:
# Apply the count_elements function to the 'GO-BP ID' column
# to update the 'GO-BP ID Count' column.
gene_disease['GO-BP ID Count'] = gene_disease['GO-BP ID'].apply(
    count_elements, sep = '|')

### Update the 'GO Definition' and 'GO Definition Count' columns

In [36]:
# Iterate thru every row in gene_disease table.
for index, row in gene_disease.iterrows():
    
    # Obtain the list of GO-BP IDs from the gene_disease row.
    go_list = row['GO-BP ID']
    
    # Create an empty list where GO ID definitions will be stored.
    def_list = []
    
    # Create accumulator to keep track of the number of definitions.
    def_count = 0
    
    # Get each GO ID by splitting the string.
    go_list = go_list.split(' | ')
    
    # Iterate thru every GO ID in the list.
    for go_id in go_list:
        
        # Add the dictionary definition to definition list.
        def_list += [go_dictionary[go_id]]

        # Increase the definition count.
        def_count += 1
        
    # Define a separator to place between definitions.
    separator = ' | '
    
    # Convert the definition list into a string.
    def_list = separator.join(def_list)
    
    # Store definition count of the def_list in the 'GO Definition 
    # Count' column and store definition list in 'GO Definition'
    # column. Specify the row of the gene_disease table using index.
    gene_disease.at[index, 'GO Definition Count'] = def_count
    gene_disease.at[index, 'GO Definition'] = def_list

#### Display the result of removing GO ID ancestors

In [37]:
# Display the result of removing GO ID ancestors
gene_disease

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID,GO Definition Count,GO Definition
0,OMIM,202110,"17,20-lyase deficiency, isolated | 17-alpha-hy...",609300,CYP17A1,1586,5,GO:0006702 | GO:0007548 | GO:0042448 | GO:0055...,5,The chemical reactions and pathways resulting ...
1,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0019674 | GO:0006741 | GO:0016310,3,The chemical reactions and pathways involving ...
2,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,2,GO:0002244 | GO:0006096,2,The process in which precursor cell type acqui...
3,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0055114 | GO:0006631 | GO:0009083,3,A metabolic process that results in the remova...
4,OMIM,273750,3-M syndrome 1,609577,CUL7,9820,13,GO:0006511 | GO:0050775 | GO:0007030 | GO:0043...,13,The chemical reactions and pathways resulting ...
...,...,...,...,...,...,...,...,...,...,...
5412,OMIM,606579,{Vitiligo-associated multiple autoimmune disea...,606636,NLRP1,22861,10,GO:0006919 | GO:0050718 | GO:0042742 | GO:0045...,10,Any process that initiates the activity of the...
5413,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,14,GO:0007267 | GO:0071222 | GO:0014808 | GO:0006...,14,Any process that mediates the transfer of info...
5414,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,31,GO:0045892 | GO:0046676 | GO:1903204 | GO:0000...,31,"Any process that stops, prevents, or reduces t..."
5415,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:0006366 | GO:0045944 | GO:0007402 | GO:0007...,7,The synthesis of RNA from a DNA template by RN...


## Remove BP term information

In [38]:
# Remove BP terms from gene_disease file.
no_bps = gene_disease.drop('GO Definition', axis = 1)

# Remove BP term count from gene_diease file.
no_bps = no_bps.drop('GO Definition Count', axis = 1)

#### Display result of removing BP term information

In [39]:
# For visualization only: may delete code line.
no_bps

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID
0,OMIM,202110,"17,20-lyase deficiency, isolated | 17-alpha-hy...",609300,CYP17A1,1586,5,GO:0006702 | GO:0007548 | GO:0042448 | GO:0055...
1,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0019674 | GO:0006741 | GO:0016310
2,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,2,GO:0002244 | GO:0006096
3,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0055114 | GO:0006631 | GO:0009083
4,OMIM,273750,3-M syndrome 1,609577,CUL7,9820,13,GO:0006511 | GO:0050775 | GO:0007030 | GO:0043...
...,...,...,...,...,...,...,...,...
5412,OMIM,606579,{Vitiligo-associated multiple autoimmune disea...,606636,NLRP1,22861,10,GO:0006919 | GO:0050718 | GO:0042742 | GO:0045...
5413,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,14,GO:0007267 | GO:0071222 | GO:0014808 | GO:0006...
5414,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,31,GO:0045892 | GO:0046676 | GO:1903204 | GO:0000...
5415,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:0006366 | GO:0045944 | GO:0007402 | GO:0007...


## Save gene-disease associations file as .csv file (no GO ID ancestors)

Save the file after the resource-intensive process of replacing Gene ID values.

In [40]:
# Specify the filename.
filename = 'Gene-Disease Associations, No GO ID Ancestors.csv'

# Make index=False so that columns are not numbered 1,2,3 through n.
no_bps.to_csv(filename, index=False)

## Add the ancestors to every entry in the gene-disease associations file

In [41]:
# Rename the No BP Ancestors table since 
# all the GO ID ancestors will be added back.
all_bps = no_bps

# Apply the remove_ancestors function to the 'GO-BP ID' column.
all_bps['GO-BP ID'] = all_bps['GO-BP ID'].apply(
    get_all_ancestors, sep = ' | ', join_sep=' | ')

#### Display the result of adding all the GO ID ancestors

In [42]:
# For visualization only: may delete code line.
all_bps

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID
0,OMIM,202110,"17,20-lyase deficiency, isolated | 17-alpha-hy...",609300,CYP17A1,1586,5,GO:0042446 | GO:0008150 | GO:0065007 | GO:0006...
1,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,3,GO:0009117 | GO:0008150 | GO:0072524 | GO:0006...
2,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,2,GO:0009117 | GO:0008150 | GO:0006753 | GO:0006...
3,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,3,GO:0016054 | GO:0008150 | GO:0071704 | GO:0008...
4,OMIM,273750,3-M syndrome 1,609577,CUL7,9820,13,GO:0007010 | GO:0065007 | GO:0031346 | GO:0007...
...,...,...,...,...,...,...,...,...
5412,OMIM,606579,{Vitiligo-associated multiple autoimmune disea...,606636,NLRP1,22861,10,GO:0070201 | GO:0065007 | GO:0050716 | GO:0051...
5413,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,14,GO:0044800 | GO:0061024 | GO:0065007 | GO:0039...
5414,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,31,GO:0070201 | GO:2000740 | GO:0000122 | GO:0071...
5415,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,7,GO:2001141 | GO:0080090 | GO:0008150 | GO:0065...


## Update the GO ID count

In [43]:
# Apply the count_separator function to the 'GO-BP ID' column
# to update the 'GO-BP ID Count' column.
all_bps['GO-BP ID Count'] = all_bps['GO-BP ID'].apply(
    count_elements, sep = '|')

## Remove entries that do not contain the root node GO:0008150 because they are erroneous

The Gene Ontology does not seem to contain the information necessary to connect a couple of biological processes to the root GO:0008150.

In [44]:
# Keep entries where the column 'GO-BP ID' contains the root node.
all_bps = all_bps[all_bps['GO-BP ID'].str.contains('GO:0008150')]

#### Display the result of updating the GO ID count and removing entries without a root node

In [45]:
# For visualization only: may delete code line.
all_bps

Unnamed: 0,DB,DB ID,Disease,MIM Number,Gene Symbol,Gene ID,GO-BP ID Count,GO-BP ID
0,OMIM,202110,"17,20-lyase deficiency, isolated | 17-alpha-hy...",609300,CYP17A1,1586,35,GO:0042446 | GO:0008150 | GO:0065007 | GO:0006...
1,OMIM,616034,"2,4-dienoyl-CoA reductase deficiency",615787,NADK2,133686,42,GO:0009117 | GO:0008150 | GO:0072524 | GO:0006...
2,OMIM,204750,2-aminoadipic 2-oxoadipic aciduria,614984,DHTKD1,55526,51,GO:0009117 | GO:0008150 | GO:0006753 | GO:0006...
3,OMIM,610006,2-methylbutyrylglycinuria,600301,ACADSB,36,28,GO:0016054 | GO:0008150 | GO:0071704 | GO:0008...
4,OMIM,273750,3-M syndrome 1,609577,CUL7,9820,105,GO:0007010 | GO:0065007 | GO:0031346 | GO:0007...
...,...,...,...,...,...,...,...,...
5412,OMIM,606579,{Vitiligo-associated multiple autoimmune disea...,606636,NLRP1,22861,120,GO:0070201 | GO:0065007 | GO:0050716 | GO:0051...
5413,OMIM,610379,"{West nile virus, susceptibility to}",601373,CCR5,1234,143,GO:0044800 | GO:0061024 | GO:0065007 | GO:0039...
5414,OMIM,616806,"{Wilms tumor 6, susceptibility to}",600571,REST,5978,287,GO:0070201 | GO:2000740 | GO:0000122 | GO:0071...
5415,OMIM,601583,{Wilms tumor susceptibility-5},609062,POU6F2,11281,38,GO:2001141 | GO:0080090 | GO:0008150 | GO:0065...


## Sort by GO-BP ID count

In [46]:
# Sort the table in descending order.
all_bps = all_bps.sort_values(by = ['GO-BP ID Count'], 
                              ascending = False)

## Save gene-disease associations file as .csv file (All GO IDs)

In [47]:
# Specify the filename.
filename = 'Gene-Disease Associations, All GO IDs.csv'

# Make index=True so that columns are not dropped.
all_bps.to_csv(filename, index=True)