**Guide to parsing OBO ontology data into a tab delimited file**
1. Download the gene ontology file with `wget http://current.geneontology.org/ontology/go.obo`.
2. Delete the first few lines until "[Term]" is the first line of the file.
3. Utilize the `parse_obo` function to parse the ontology data into a tab delimited file. If you download this notebook and have the `go.obo` file in the same directory, then you can simply run the following cells to output a tab delimited file:

In [1]:
# Define a function to parse the text file and save it as a tab-delimited file
def parse_obo(filename, output_filename):
    data = []
    current_entry = {}

    with open(filename, 'r') as file:
        for line in file:
            line = line.strip()
            if line.startswith('['):
                # If a new entry starts, save the current entry and start a new one
                if current_entry:
                    data.append(current_entry)
                current_entry = {}
            elif line:
                # Split the line into key and value
                key, value = map(str.strip, line.split(':', 1))
                current_entry[key] = value

    # Append the last entry
    if current_entry:
        data.append(current_entry)

    # Save the data as a tab-delimited file
    with open(output_filename, 'w') as output_file:
        # Write the column headers (keys from the first entry)
        if data:
            headers = data[0].keys()
            output_file.write('\t'.join(headers) + '\n')

            # Write the data
            for entry in data:
                values = [entry.get(header, '') for header in headers]
                output_file.write('\t'.join(values) + '\n')


Initial data import (2023-09-29)

In [None]:
# Call the function with your file names
input_filename = 'go.obo'
output_filename = 'go_temp.txt'
parse_obo(input_filename, output_filename)

Never annotate list (2024-03-28)

In [None]:
# Call the function with your file names
input_filename = 'gocheck_do_not_annotate.obo'
output_filename = 'gocheck_do_not_annotate_temp.txt'
parse_obo(input_filename, output_filename)

Update (2024-07-17)

In [7]:
# Call the function with your files that you wget from the official website:
input_filename = 'go_2024-07-17.obo'
output_filename = 'go_2024-07-17_temp.txt'
parse_obo(input_filename, output_filename)

input_filename = 'gocheck_do_not_annotate_2024-07-17.obo'
output_filename = 'gocheck_do_not_annotate_2024-07-17_temp.txt'
parse_obo(input_filename, output_filename)

4. Neo4j doesn't like having strings after quotation marks, so now we remove all the double quotes from the file with the following cells:

In [3]:
def remove_quotation_marks(input_file, output_file):
    """
    Removes quotation marks from a text file and saves the result to another file.

    Args:
        input_file (str): Path to the input text file.
        output_file (str): Path to the output text file.
    """
    try:
        with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
            for line in infile:
                modified_line = line.replace('"', '')
                outfile.write(modified_line)
        print(f'Quotation marks removed from {input_file} and saved to {output_file}')
    except FileNotFoundError:
        print(f'Error: File not found - {input_file}')

In [5]:
import os
# Example usage:
input_file_path = 'go_temp.txt'  # Replace with the path to your input file.
output_file_path = 'go.txt'  # Replace with the path where you want to save the output.
remove_quotation_marks(input_file_path, output_file_path)
os.remove(input_file_path)

Quotation marks removed from go_temp.txt and saved to go.txt


In [None]:
import os
# Example usage:
input_file_path = 'gocheck_do_not_annotate_temp.txt'  # Replace with the path to your input file.
output_file_path = 'gocheck_do_not_annotate.txt'  # Replace with the path where you want to save the output.
remove_quotation_marks(input_file_path, output_file_path)
os.remove(input_file_path)

In [8]:
import os
# Remove quotes:
input_file_path = 'go_2024-07-17_temp.txt'  # Replace with the path to your input file.
output_file_path = 'go_2024-07-17.txt'  # Replace with the path where you want to save the output.
remove_quotation_marks(input_file_path, output_file_path)
os.remove(input_file_path)

input_file_path = 'gocheck_do_not_annotate_2024-07-17_temp.txt'  # Replace with the path to your input file.
output_file_path = 'gocheck_do_not_annotate_2024-07-17.txt'  # Replace with the path where you want to save the output.
remove_quotation_marks(input_file_path, output_file_path)
os.remove(input_file_path)

Quotation marks removed from go_2024-07-17_temp.txt and saved to go_2024-07-17.txt
Quotation marks removed from gocheck_do_not_annotate_2024-07-17_temp.txt and saved to gocheck_do_not_annotate_2024-07-17.txt


5. Remove the last few lines of the `go.txt` file that have a different format than the preceding lines. There should be a total of 47596 lines in the final `go.txt` file.
6. Place the `go.txt` file into the `~/neo4j/import/` directory and continue with the `DataImportTutorial.md` guide.