# URI Definitions Generator

This notebook processes URIs from the GutBrainIE dataset to generate comprehensive definitions for Named Entity Linking (NEL). The workflow includes:

1. **Data Loading & Validation**: Load URIs from CSV and validate their formats
2. **Collection-based Definitions**: Extract definitions from existing annotation collections
3. **External Definitions**: Retrieve definitions from external sources via URI resolution  
4. **Definition Merging**: Combine definitions from both sources, removing duplicates
5. **Output Generation**: Create multiple output formats for downstream processing

The final outputs include:
- `merged_uri_definitions.json`: Combined definitions for each URI
- `split_uri_definitions.json`: Individual definitions with numeric IDs
- `id_to_uri.json`: Mapping from definition IDs back to URIs

# IMPORTANT
Before running, open the `find_uri_definitions.py` file and set the UMLS api key

## Setup and Imports

Import required libraries and custom modules for URI definition processing.

In [None]:
# Standard library imports
import json
import os

# Third-party imports  
import pandas as pd
from tqdm import tqdm

# Custom module imports
from find_uri_definitions import find_definition, find_definition_from_collections

## Step 1: Data Loading and URI Validation

Load the URIs from the CSV file and validate their formats. We expect URIs to contain specific domains (UMLS, PURL, W3ID, MeSH) to ensure they are valid biomedical ontology identifiers.

In [None]:
# Load URIs from the CSV file
uri_dataframe = pd.read_csv('../../Annotations/uris.csv', header=0)

# Validate URI formats - check for expected biomedical ontology domains
VALID_URI_DOMAINS = ['umls', 'purl', 'w3id', 'mesh']

print("Validating URI formats...")
for row_index, row in uri_dataframe.iterrows():
    # Skip header row (index 0)
    if row_index == 0:
        continue
    
    current_uri = row['uri']
    
    # Check if URI contains any of the valid domains
    is_valid_uri = any(domain in current_uri for domain in VALID_URI_DOMAINS)
    
    if not is_valid_uri:
        print(f'Warning: Row {row_index} has potentially invalid URI: {current_uri}')

Row 1155 has invalid URI: http://www.ebi.ac.uk/swo/SWO_0000243


## Step 2: Extract Definitions from Collections

Extract definitions for each URI from existing annotation collections (dev, train_platinum, train_gold). This provides context-specific definitions based on how entities are used in the dataset.

**Progress saving**: Results are saved every 20 iterations to prevent data loss in case of interruption.

In [None]:
# Initialize storage for definitions extracted from collections
uri_to_collection_definitions = {}

# Define collection files to search for definitions
annotation_collection_files = [
    "../../Annotations/Dev/json_format/dev.json",
    "../../Annotations/Train/platinum_quality/json_format/train_platinum.json",
    "../../Annotations/Train/gold_quality/json_format/train_gold.json"
]

print(f"Extracting definitions from {len(annotation_collection_files)} collection files...")

# Process each URI to find definitions in collections
for row_index, row in tqdm(uri_dataframe.iterrows(), total=uri_dataframe.shape[0], desc="Processing URIs"):
    # Skip header row (index 0)
    if row_index == 0:
        continue
    
    current_uri = row['uri']
    
    try:
        # Find definitions for this URI in the collection files
        uri_to_collection_definitions[current_uri] = find_definition_from_collections(
            current_uri, annotation_collection_files
        )
    except Exception as error:
        print(f"Error processing {current_uri}: {error}")
        uri_to_collection_definitions[current_uri] = []
    
    # Save progress every 20 iterations to prevent data loss
    if row_index % 20 == 0:
        with open('uri_collection_definitions.json', 'w', encoding='utf-8') as output_file:
            json.dump(uri_to_collection_definitions, output_file, indent=4, ensure_ascii=False)

print(f"Collection-based definitions extracted for {len(uri_to_collection_definitions)} URIs")

## Step 3: Retrieve External Definitions

Retrieve definitions by resolving URIs directly from external sources (ontology websites, APIs, etc.). 

**Resumable processing**: This cell can be run multiple times - it will load previously computed definitions and only process missing ones.

**Special cases**: Some URIs require manual handling due to API limitations or specific formatting needs.

In [None]:
# Load previously computed definitions if available (enables resumable processing)
EXTERNAL_DEFINITIONS_FILE = 'uri_retrieved_definitions.json'

if os.path.exists(EXTERNAL_DEFINITIONS_FILE):
    print(f"Loading existing definitions from '{EXTERNAL_DEFINITIONS_FILE}'")
    with open(EXTERNAL_DEFINITIONS_FILE, 'r', encoding='utf-8') as input_file:
        previously_computed_definitions = json.load(input_file)
else:
    print("No existing definitions file found, starting from scratch")
    previously_computed_definitions = {}

# Initialize storage for external definitions    
uri_to_external_definitions = {}

print("Retrieving definitions from external sources...")

# Process each URI to get external definitions
for row_index, row in tqdm(uri_dataframe.iterrows(), total=uri_dataframe.shape[0], desc="Retrieving external definitions"):
    # Skip header row (index 0)
    if row_index == 0:
        continue
    
    current_uri = row['uri']
    
    # Skip if we already have definitions for this URI from previous runs
    if current_uri in previously_computed_definitions and len(previously_computed_definitions[current_uri]) > 0:
        uri_to_external_definitions[current_uri] = previously_computed_definitions[current_uri]
        continue
    
    # Handle special case URI that requires manual definition
    if current_uri == "http://www.ebi.ac.uk/swo/SWO_0000243":
        uri_to_external_definitions[current_uri] = ["Jaccard's index"]
    else:
        try:
            # Attempt to retrieve definition from external source
            uri_to_external_definitions[current_uri] = find_definition(current_uri)
        except Exception as error:
            print(f"Error retrieving definition for {current_uri}: {error}")
            uri_to_external_definitions[current_uri] = []
    
    # Save progress every 20 iterations to prevent data loss
    if row_index % 20 == 0:
        with open(EXTERNAL_DEFINITIONS_FILE, 'w', encoding='utf-8') as output_file:
            json.dump(uri_to_external_definitions, output_file, indent=4, ensure_ascii=False)

print(f"External definitions retrieved for {len(uri_to_external_definitions)} URIs")

Loading existing definitions from 'uri_definitions.json'


100%|██████████| 1761/1761 [00:02<00:00, 860.22it/s] 


## Step 4: Merge Definitions from All Sources  

Combine definitions from both collection-based and external sources. This step:
- Uses sets to automatically eliminate duplicates
- Normalizes text by converting to lowercase and stripping whitespace
- Provides statistics on coverage and empty definitions

In [None]:
# Load definitions from both sources
COLLECTION_DEFINITIONS_FILE = "uri_collection_definitions.json"
EXTERNAL_DEFINITIONS_FILE = "uri_retrieved_definitions.json"

print("Loading definitions from both sources...")

with open(COLLECTION_DEFINITIONS_FILE, 'r', encoding='utf-8') as input_file:
    collection_based_definitions = json.load(input_file)

with open(EXTERNAL_DEFINITIONS_FILE, 'r', encoding='utf-8') as input_file:
    external_definitions = json.load(input_file)

# Initialize merged definitions with sets to automatically handle duplicates
merged_uri_definitions = {uri: set() for uri in external_definitions.keys()}

print("Merging definitions from collection sources...")
# Add definitions from collection sources
for uri, definition_list in collection_based_definitions.items():
    if len(definition_list) > 0:
        # Normalize definitions: lowercase and strip whitespace
        normalized_definitions = [definition.lower().strip() for definition in definition_list]
        merged_uri_definitions[uri].update(normalized_definitions)

print("Merging definitions from external sources...")
# Add definitions from external sources  
for uri, definition_list in external_definitions.items():
    if len(definition_list) > 0:
        # Normalize definitions: lowercase and strip whitespace
        normalized_definitions = [definition.lower().strip() for definition in definition_list]
        merged_uri_definitions[uri].update(normalized_definitions)

# Calculate and display statistics
total_uris = len(merged_uri_definitions)
empty_definitions_count = sum(1 for definitions in merged_uri_definitions.values() if len(definitions) == 0)

print(f"\n=== Merging Statistics ===")
print(f"Total URIs processed: {total_uris}")
print(f"URIs with definitions: {total_uris - empty_definitions_count}")
print(f"URIs with empty definitions: {empty_definitions_count}")
print(f"Coverage: {((total_uris - empty_definitions_count) / total_uris * 100):.1f}%")

Number of URIs in merged definitions: 1760
Number of URIs with empty definitions in merged definitions: 0


## Step 5: Save Merged Definitions

Save the merged definitions to a JSON file, converting sets back to lists for JSON serialization.

In [None]:
# Save merged definitions to JSON file
MERGED_DEFINITIONS_FILE = 'merged_uri_definitions.json'

print(f"Saving merged definitions to {MERGED_DEFINITIONS_FILE}...")

# Convert sets to lists for JSON serialization
merged_definitions_for_json = {
    uri: list(definition_set) 
    for uri, definition_set in merged_uri_definitions.items()
}

with open(MERGED_DEFINITIONS_FILE, 'w', encoding='utf-8') as output_file:
    json.dump(merged_definitions_for_json, output_file, indent=4, ensure_ascii=False)

print(f"Successfully saved {len(merged_definitions_for_json)} URI definitions")

## Step 6: Generate Indexed Definition Format

Create alternative output formats for downstream processing:
- **Split definitions**: Each individual definition gets a unique numeric ID
- **ID-to-URI mapping**: Maps definition IDs back to their source URIs

This format is useful for embedding generation and similarity search systems.

In [None]:
# Load the merged definitions for processing
MERGED_DEFINITIONS_FILE = 'merged_uri_definitions.json'

print(f"Loading merged definitions from {MERGED_DEFINITIONS_FILE}...")

with open(MERGED_DEFINITIONS_FILE, 'r', encoding='utf-8') as input_file:
    final_uri_definitions = json.load(input_file)

print(f"Loaded definitions for {len(final_uri_definitions)} URIs")

In [None]:
# Create indexed format: assign unique IDs to individual definitions
indexed_definitions = {}  # definition_id -> definition_text
definition_id_to_uri = {}  # definition_id -> source_uri

current_definition_id = 0

print("Creating indexed definition format...")

# Process each URI and its definitions
for source_uri, definition_list in final_uri_definitions.items():
    for individual_definition in definition_list:
        # Assign unique ID to this definition
        indexed_definitions[current_definition_id] = individual_definition
        definition_id_to_uri[current_definition_id] = source_uri
        current_definition_id += 1

# Save the indexed format files
SPLIT_DEFINITIONS_FILE = "split_uri_definitions.json"
ID_TO_URI_FILE = "id_to_uri.json"

print(f"Saving {len(indexed_definitions)} individual definitions...")

with open(SPLIT_DEFINITIONS_FILE, "w", encoding="utf-8") as output_file:
    json.dump(indexed_definitions, output_file, indent=4)

with open(ID_TO_URI_FILE, "w", encoding="utf-8") as output_file:
    json.dump(definition_id_to_uri, output_file, indent=4)

print(f"\n=== Final Output Summary ===")
print(f"Generated files:")
print(f"  - {MERGED_DEFINITIONS_FILE}: {len(final_uri_definitions)} URIs with definitions")
print(f"  - {SPLIT_DEFINITIONS_FILE}: {len(indexed_definitions)} individual definitions")
print(f"  - {ID_TO_URI_FILE}: {len(definition_id_to_uri)} ID-to-URI mappings")
print(f"\nProcessing complete! 🎉")