**GO Relationship Import**

**Guide to parsing OBO ontology data into a tab delimited file**
1. Download the gene ontology file with `wget http://current.geneontology.org/ontology/go-basic.obo`.
2. Delete the first few lines until "[Term]" is the first line of the file.
3. Utilize the following functions to parse the ontology data into a tab delimited file. If you download this notebook and have the `go-basic.obo` file in the same directory, then you can simply run the following cells to output a tab delimited file:

First, get the "is_a" relationships.

In [1]:
import re
import json

# Open the input file for reading
with open('go-basic.obo', 'r') as input_file:
    # Open a file for writing the tab-delimited rows
    with open('output_file.tsv', 'w') as output_file:
        # Write header
        output_file.write("id\tis_a\n")

        # Initialize variables to store id and relationships
        current_id = ""
        is_a = []

        # Iterate through each line in the input file
        for line in input_file:
            line = line.strip()

            # Check if the line starts with "[Term]"
            if line.startswith("[Term]"):
                # If a previous term had relationships, write them to the output file
                if current_id and is_a:
                    for i in is_a:
                        output_file.write(f"{current_id}\t{i}\n")

                # Reset variables for the new term
                current_id = ""
                is_a = []
            elif line.startswith("id:"):
                current_id = line.replace("id: ", "")
            elif line.startswith("is_a:"):
                is_a.append(line.split("is_a:")[1].strip().split()[0])

        # Write the last term's relationships if any
        if current_id and is_a:
            for i in is_a:
                output_file.write(f"{current_id}\t{i}\n")


Next add a new column to ease import into Neo4j.

In [2]:
import pandas as pd

# Step 2: Read the tab-delimited file into a DataFrame
file_path = 'go_is_a.tsv'
df = pd.read_csv(file_path, sep='\t')

# Step 3: Add a new column to the DataFrame
df['is_a'] = 'is_a'

# Step 4: Save the modified DataFrame to a new file
output_file_path = 'is_a_import.tsv'
df.to_csv(output_file_path, sep='\t', index=False)

# Display the DataFrame
print(df)

                      id         id2  is_a
0             GO:0000001  GO:0048308  is_a
1             GO:0000001  GO:0048311  is_a
2             GO:0000002  GO:0007005  is_a
3             GO:0000003  GO:0008150  is_a
4             GO:0000006  GO:0005385  is_a
...                  ...         ...   ...
68311  term_tracker_item  GO:0120255  is_a
68312  term_tracker_item  GO:1901362  is_a
68313  term_tracker_item  GO:2001316  is_a
68314  term_tracker_item   regulates  is_a
68315  term_tracker_item   regulates  is_a

[68316 rows x 3 columns]


Delete the last 8 lines that start with "term_tracker_item".

Now extract the relationship data between GO terms.

In [1]:
# Open the input file for reading
with open('go-basic.obo', 'r') as input_file:
    # Open a file for writing the tab-delimited rows
    with open('output_file.tsv', 'w') as output_file:
        # Write header
        output_file.write("id\trelationship\n")

        # Initialize variables to store id and relationships
        current_id = ""
        relationships = []

        # Iterate through each line in the input file
        for line in input_file:
            line = line.strip()

            # Check if the line starts with "[Term]"
            if line.startswith("[Term]"):
                # If a previous term had relationships, write them to the output file
                if current_id and relationships:
                    for relationship in relationships:
                        output_file.write(f"{current_id}\t{relationship}\n")

                # Reset variables for the new term
                current_id = ""
                relationships = []
            elif line.startswith("id:"):
                current_id = line.replace("id: ", "")
            elif line.startswith("relationship:"):
                relationships.append(line.split("relationship:")[1].strip())

        # Write the last term's relationships if any
        if current_id and relationships:
            for relationship in relationships:
                output_file.write(f"{current_id}\t{relationship}\n")


Parse only the relevant relationship data from the new tab delimited file.

In [3]:
import pandas as pd

file_path = 'go-relationship.tsv'
df = pd.read_csv(file_path, sep='\t')

df['id2'] = df['relationship'].str.split(' ').str[1]

df['relationship'] = df['relationship'].str.split(' ').str[0]

output_file_path = 'relationship_import.tsv'
df.to_csv(output_file_path, sep='\t', index=False)

# Display the DataFrame
print(df)

               id          relationship         id2
0      GO:0000015               part_of  GO:0005829
1      GO:0000018             regulates  GO:0006310
2      GO:0000019             regulates  GO:0006312
3      GO:0000022               part_of  GO:0000070
4      GO:0000022               part_of  GO:0007052
...           ...                   ...         ...
15262  GO:2001284             regulates  GO:0038055
15263  GO:2001285  negatively_regulates  GO:0038055
15264  GO:2001286             regulates  GO:0072584
15265  GO:2001287  negatively_regulates  GO:0072584
15266  GO:2001288  positively_regulates  GO:0072584

[15267 rows x 3 columns]
