### Preprocessing of BioPAX files

Need : 
- concatenation of independant pathway files into a single OWL file 
- correction of invalid URIs : 
    - that contain spaces inside
    - that contain brackets
    - that contain spaces or tabs at the end
- writing of a correction version of file 

In [9]:
import os
import glob
from requests.utils import requote_uri
from urllib.parse import quote
import re
import shutil

### Problems in BioPAX files

- Panther (standalone BioPAX export and PathwayCommons export): brackets in URI, file `Gonadotropin_releasing_hormone_receptor_pathway.owl`
`ERROR riot            :: [line: 19044, col: 53] {W002} <https://pantherdb.org/pathways/biopax/P06664#_[GnRH_GnRHR]_gnas_s808_csa54_> Code: 0/ILLEGAL_CHARACTER in FRAGMENT: The character violates the grammar rules for URIs/IRIs`

- PathBank (PathwayCommons export): double whitespace in URI, file `PW002037.owl` and `PW123534.owl`

- KEGG (PathwayCommons export): `.translated`files from PathwayCommons are not supported for loading into the SPARQL endpoint. Need a conversion to `.owl`

- NetPath (PathwayCommons export): 
    - `NetPath_7.owl` : whitespace
    - `NetPath_11.owl`: double whitespace
    - `NetPath_22.owl`: whitespace
    - `NetPath_9.owl` : brackets 



#### Function to correct URIs in invalid BioPAX files

In [10]:
def correct_invalid_uris(owl_file, output_file):
    """
    Reads the concatenated OWL file, identifies invalid URIs, corrects them conditionally, and writes a new file.

    - If a URI contains whitespace or tabulation, it is corrected using requote_uri.
    - If a URI contains brackets ([]), it is corrected using urllib.parse.quote.

    Arguments:
        owl_file (str): Path to the OWL file to check for invalid URIs.
        output_file (str): Path to write the corrected OWL file.
    """
    
    print(" ")
    print(str(owl_file))

    # Regular expression to find strings between double quotes
    quote_pattern = re.compile(r'"([^"]*)"')
    
    # Define different sets of invalid characters
    whitespace_or_tab = r'[\s]'  # matches spaces or tabs
    brackets = r'[\[\]]'  # matches square brackets
    
    with open(owl_file, 'r') as infile:
        lines = infile.readlines()

    corrected_lines = []
    invalid_uris = []

    # Iterate through each line of the file to find strings between quotes
    for line_num, line in enumerate(lines, 1):
        # Find all strings between quotes
        quoted_strings = quote_pattern.findall(line)
        corrected_line = line

        for uri_value in quoted_strings:
            corrected_uri = uri_value  # Initially assume the URI is valid

            # Check if the URI contains whitespace or tab characters
            if re.search(whitespace_or_tab, uri_value):
                invalid_uris.append((line_num, uri_value, 'whitespace or tab'))
                # Correct the URI using requote_uri (for spaces/tabs)
                corrected_uri = requote_uri(uri_value)
            
            # Check if the URI contains square brackets
            elif re.search(brackets, uri_value):
                invalid_uris.append((line_num, uri_value, 'brackets'))
                # Correct the URI using quote (for brackets)
                corrected_uri = quote(uri_value, safe=":/#")

            # If the URI was corrected, replace it in the line
            if corrected_uri != uri_value:
                corrected_line = corrected_line.replace(f'"{uri_value}"', f'"{corrected_uri}"')

        # Add the corrected line to the list
        corrected_lines.append(corrected_line)
    
    # Write the corrected lines to a new output OWL file
    with open(output_file, 'w') as outfile:
        outfile.writelines(corrected_lines)

    # Print the invalid URIs and their corrections
    if invalid_uris:
        print("Invalid URIs found and corrected:")
        for line_num, uri, issue in invalid_uris:
            corrected = requote_uri(uri) if issue == 'whitespace or tab' else quote(uri, safe=":/#")
            print(f"Line {line_num}: {uri} -> {corrected} (issue: {issue})")
    else:
        print("No invalid URIs found.")



# Correct Panther files
correct_invalid_uris("BioPAX_Data/PantherBioPAX/BioPAX/Gonadotropin_releasing_hormone_receptor_pathway.owl", "BioPAX_Data/PantherBioPAX/BioPAX/Gonadotropin_releasing_hormone_receptor_pathway_corrected.owl")
correct_invalid_uris("BioPAX_Data/PathwayCommons/panther_pc/Gonadotropin_releasing_hormone_receptor_pathway.owl", "BioPAX_Data/PathwayCommons/panther_pc/Gonadotropin_releasing_hormone_receptor_pathway_corrected.owl")

# Correct PathBank files
correct_invalid_uris("BioPAX_Data/PathwayCommons/pathbank_pc/PW002037.owl", "BioPAX_Data/PathwayCommons/pathbank_pc/PW002037_corrected.owl")
correct_invalid_uris("BioPAX_Data/PathwayCommons/pathbank_pc/PW123534.owl", "BioPAX_Data/PathwayCommons/pathbank_pc/PW123534_corrected.owl")

# Correct NetPath files
correct_invalid_uris("BioPAX_Data/PathwayCommons/netpath_pc/NetPath_7.owl", "BioPAX_Data/PathwayCommons/netpath_pc/NetPath_7_corrected.owl")
correct_invalid_uris("BioPAX_Data/PathwayCommons/netpath_pc/NetPath_11.owl", "BioPAX_Data/PathwayCommons/netpath_pc/NetPath_11_corrected.owl")
correct_invalid_uris("BioPAX_Data/PathwayCommons/netpath_pc/NetPath_22.owl", "BioPAX_Data/PathwayCommons/netpath_pc/NetPath_22_corrected.owl")
correct_invalid_uris("BioPAX_Data/PathwayCommons/netpath_pc/NetPath_9.owl", "BioPAX_Data/PathwayCommons/netpath_pc/NetPath_9_corrected.owl")

 
BioPAX_Data/PantherBioPAX/BioPAX/Gonadotropin_releasing_hormone_receptor_pathway.owl
Invalid URIs found and corrected:
Line 19044: _[GnRH_GnRHR]_gnas_s808_csa54_ -> _%5BGnRH_GnRHR%5D_gnas_s808_csa54_ (issue: brackets)
 
BioPAX_Data/PathwayCommons/panther_pc/Gonadotropin_releasing_hormone_receptor_pathway.owl
Invalid URIs found and corrected:
Line 19037: _[GnRH_GnRHR]_gnas_s808_csa54_ -> _%5BGnRH_GnRHR%5D_gnas_s808_csa54_ (issue: brackets)
 
BioPAX_Data/PathwayCommons/pathbank_pc/PW002037.owl
Invalid URIs found and corrected:
Line 358: #Reference/KEGG_Compound_C05543	 -> #Reference/KEGG_Compound_C05543%09 (issue: whitespace or tab)
Line 569: Reference/KEGG_Compound_C05543	 -> Reference/KEGG_Compound_C05543%09 (issue: whitespace or tab)
 
BioPAX_Data/PathwayCommons/pathbank_pc/PW123534.owl
Invalid URIs found and corrected:
Line 330: #Reference/KEGG_Compound_C05543	 -> #Reference/KEGG_Compound_C05543%09 (issue: whitespace or tab)
Line 611: Reference/KEGG_Compound_C05543	 -> Reference/KE

#### Translate the KEGG PathwayCommons .translated files to .owl

In [11]:
# Define the folder containing the files
folder_path = "BioPAX_Data/PathwayCommons/kegg_pc/"

for file_name in os.listdir(folder_path):
    # Check if the file has the .translated extension
    if file_name.endswith(".translated"):
        # Create the new file name with the .owl extension
        base_name = os.path.splitext(file_name)[0]  # Remove the .translated part
        new_file_name = f"{base_name}.owl"
        
        # Define full paths for input and output
        old_file_path = os.path.join(folder_path, file_name)
        new_file_path = os.path.join(folder_path, new_file_name)
        
        # Copy the file to the new name with the .owl extension
        shutil.copy(old_file_path, new_file_path)
        print(f"Copied: {old_file_path} -> {new_file_path}")

print("All .translated files have been copied as .owl.")

Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa01100.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa01100.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00380.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa00380.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00980.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa00980.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00360.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa00360.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00533.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa00533.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00630.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa00630.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00512.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa00512.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00780.translated -> BioPAX_Data/PathwayCommons/kegg_pc/hsa00780.owl
Copied: BioPAX_Data/PathwayCommons/kegg_pc/hsa00053.translated -> BioPAX_Data/PathwayCom