# Introduction
This notebook demonstrates the process of extracting knowledge from biomedical abstracts, constructing a knowledge graph, and querying the graph to answer domain-specific questions. It integrates Large Language Models (LLMs) and a graph database (Neo4j) to extract entities, relationships, and insights from biomedical data.


# Installing Required Libraries
This section installs the necessary libraries for the project:
- **fuzzywuzzy**: Compares text for similarity, helpful for matching terms.
- **neo4j**: Connects and interacts with the Neo4j graph database.
- **openai**: Interfaces with OpenAI's language models.

Run the command below to install these libraries if they're not already installed.


In [None]:
!pip install fuzzywuzzy neo4j openai==0.28



# Imports
This section brings in essential libraries, which are tools that simplify tasks:
- **openai**: Connects to OpenAI's language model service.
- **pandas**: Manages large datasets in tabular format.
- **fuzzywuzzy**: Compares text for similarities, useful for matching terms.
- **neo4j**: Interfaces with the Neo4j graph database.
- **json**: Handles data in a human-readable and machine-friendly format.


In [None]:
# Import necessary libraries
import openai
import pandas as pd
from fuzzywuzzy import process
from neo4j import GraphDatabase
import json

# Set your OpenAI API key
openai.api_key = ""

# Load the abstracts CSV
abstracts_df = pd.read_csv("/content/Test_abstracts.csv")

# Defining the entity and relationship schema which are the columns found in the ontology csv file.
1. **Entity Types**: These are categories or types of nodes in the knowledge graph (e.g., drugs, diseases).
2. **Relationship Types**: These represent the connections between entities (e.g., "treated with drug", "adverse event occurs in").
The ontology schema aligns the data with predefined standards to ensure consistency and relevance.


In [None]:
# Sample ontology terms (adjust to your project’s terms)
entity_types = ["class_id", "preferred_label", "synonyms", "semantic_types","mesh_cui","rxnorm_cui"]
relationship_types = [
    "adjacent_to",
    "treated with drug", "adverse event occurs in",
"adverse event outcome",
"anterior_to",
"anteriorly connected to",
"bearer of",
"bounding layer of",
"composed primarily of",
"connected to",
"develops_from",
"develops_into",
"drug AE occurs in",
"drug associated with AE",
"drug associated with AE in adolescent",
"drug associated with AE in adult",
"drug associated with AE in newborn",
"drug associated with AE in pediatric",
"drug associated with AE in senior",
"example of usage",
"has disease",
"has part",
"induced_by",
"induces",
"is count of",
"is evidence of",
"located_in",
"location_of",
"may_diagnose",
"may_prevent",
"may_treat",
"occurs in",
"occurs in adult having disease",
"occurs in patient having disease",
"occurs in patient treated with drug",
"overlaps",
"part of",
"produced_by",
"produces",
"protects",
"realized in",
"realizes",
"source",
"starts",
"surrounded_by",
"surrounds"

]




# Fuzzy function
This function helps match terms that may be slightly different from those in our list. For example, if a term in an abstract is similar to one in the predefined ontology terms (like relationship_types), it’s matched and accepted. If not similar enough, it’s returned as-is.

In [None]:
# Fuzzy matching function to map terms to ontology with a threshold
def map_to_ontology(term, ontology_terms, threshold=80):
    match, score = process.extractOne(term, ontology_terms)
    return match if score >= threshold else term

# LLM Extraction
OpenAI prompt which takes the above schema and abstract and extracts from the abstracts terms based on the schema

This section uses OpenAI's language model to extract entities (e.g., drugs, diseases) and relationships (e.g., "causes", "treats") from biomedical abstracts.

**Steps:**
1. **Define a Prompt**: The prompt instructs the language model on how to process the abstract and extract structured information.
2. **Generate Output**: The model outputs entities and relationships in a JSON format for easy integration with the knowledge graph.


In [None]:
# Function to extract entities and relationships using OpenAI
def extract_entities_relationships(abstract, pubmed_id):
    prompt = f"""
    Based on the provided ontology terms, please extract entities and relationships from the following abstract.
    Use ontology relationship types only if they are contextually relevant; otherwise, create a descriptive label.

    Ontology Entities (Types): {entity_types}
    Ontology Relationships (Use only if contextually appropriate): {relationship_types}

    Abstract:
    {abstract}

    Format your response as JSON:
    {{
      "entities": [
        {{"id": "Entity1", "type": "Type1", "PubMedID": "{pubmed_id}"}},
        {{"id": "Entity2", "type": "Type2", "PubMedID": "{pubmed_id}"}}
      ],
      "relationships": [
        {{"source": "Entity1", "target": "Entity2", "relation": "RELATION_TYPE", "PubMedID": "{pubmed_id}"}}
      ]
    }}

    Ensure that relationships align with the context and avoid forcing any relationships to match a provided type.
    """

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5,
        max_tokens=1500
    )

    result = response['choices'][0]['message']['content']
    parsed_result = json.loads(result)
    entities = parsed_result.get('entities', [])
    relationships = parsed_result.get('relationships', [])

    return entities, relationships


# Normalization
this part makes sure that the terms and types used in the above openai extracted entities and relationships are consistent with predefined ontology terms

In [None]:
# Normalize entities and relationships
def validate_entities_and_relationships(entities, relationships):
    normalized_entities = []
    normalized_relationships = []

    for entity in entities:
        original_type = entity['type']
        normalized_type = map_to_ontology(entity['type'], entity_types)
        entity['type'] = normalized_type
        normalized_entities.append({
            'original_id': entity['id'],
            'original_type': original_type,
            'normalized_id': entity['id'],
            'normalized_type': normalized_type,
            'PubMedID': entity.get('PubMedID')
        })

    for relationship in relationships:
        original_relation = relationship['relation']
        normalized_relation = map_to_ontology(relationship['relation'], relationship_types)
        relationship['relation'] = normalized_relation
        normalized_relationships.append({
            'source': relationship['source'],
            'target': relationship['target'],
            'original_relation': original_relation,
            'normalized_relation': normalized_relation,
            'PubMedID': relationship.get('PubMedID')
        })

    return normalized_entities, normalized_relationships


# Output csv files
The output of the above codes are all saved in these files such as entites before normalization and after normalization for observation. This is where all the functions are called and the outputs are saved in their files. There is also a log twxt file being created for any abstracts that couldnt be processed/extracted. 

In [None]:
import json

all_raw_entities = []
all_raw_relationships = []
all_normalized_entities = []
all_normalized_relationships = []

# Open a log file to track errors
error_log = open("error_log_validation.txt", "w")

for index, row in abstracts_df.iterrows():
    pubmed_id = row['PubMedID']
    abstract = row['Abstract']
    try:
        # Extract entities and relationships
        entities, relationships = extract_entities_relationships(abstract, pubmed_id)

        # Save raw entities and relationships
        all_raw_entities.extend(entities)
        all_raw_relationships.extend(relationships)

        # Normalize entities and relationships
        normalized_entities, normalized_relationships = validate_entities_and_relationships(entities, relationships)

        # Save normalized entities and relationships
        all_normalized_entities.extend(normalized_entities)
        all_normalized_relationships.extend(normalized_relationships)

    except json.JSONDecodeError as e:
        # Log JSON decode errors
        error_log.write(f"JSONDecodeError for PubMedID {pubmed_id}: {e}\n")
        error_log.write(f"Abstract: {abstract}\n\n")
        print(f"Error processing PubMedID {pubmed_id}: JSONDecodeError. Skipping...")
        continue

    except Exception as e:
        # Log any other errors
        error_log.write(f"General Error for PubMedID {pubmed_id}: {e}\n")
        error_log.write(f"Abstract: {abstract}\n\n")
        print(f"Error processing PubMedID {pubmed_id}: {e}. Skipping...")
        continue

# Close the log file
error_log.close()

# Save processed data to CSV for review
#pd.DataFrame(all_raw_entities).to_csv("raw_entities3.csv", index=False)
#pd.DataFrame(all_raw_relationships).to_csv("raw_relationships3.csv", index=False)
pd.DataFrame(all_normalized_entities).to_csv("normalized_entities_validation.csv", index=False)
pd.DataFrame(all_normalized_relationships).to_csv("normalized_relationships_validation.csv", index=False)

print("Processing complete. Check 'error_log.txt' for any skipped abstracts.")


Error processing PubMedID 38906514: JSONDecodeError. Skipping...
Error processing PubMedID 38759667: JSONDecodeError. Skipping...
Error processing PubMedID 38651308: JSONDecodeError. Skipping...
Error processing PubMedID 38651235: JSONDecodeError. Skipping...
Error processing PubMedID 38573443: JSONDecodeError. Skipping...
Processing complete. Check 'error_log.txt' for any skipped abstracts.



# Loading Data into Neo4j
This section defines a function to insert extracted entities and relationships into the Neo4j knowledge graph database. It takes the 2 csv files that have the normalized entities and relationships and put it into neo4j as nodes and edges and it also merges the entites to avoid duplication if the id/name is the same and their type is different.

**Steps:**
1. **Connect to Neo4j**: Establishes a connection to the graph database.
   - Neo4j Connection Details:
        - URI (uri): Specifies the Neo4j instance to connect to.
        - Credentials (username, password): Authentication for accessing the database.
    - Driver Initialization: GraphDatabase.driver: Establishes the connection to Neo4j using the URI and credentials.
2. **Sanitization Function**: Clean and format strings to make them compatible with Neo4j's requirements.

   - Input Validation: Ensures the input is a string. If not, it returns the value as-is.
    
   - String Cleaning: Replaces or removes problematic characters like hyphens (-), slashes (/), quotes, parentheses, colons, and percentages. For relationships, replaces spaces with underscores to conform to Neo4j's naming conventions.
    
   - Remove Non-Alphanumeric Characters: Ensures only letters, numbers, underscores, or spaces are kept.
    
   - Lowercase Conversion: Converts the sanitized string to lowercase for consistency.
    
2. **Insert Nodes (Entities)**: Insert entities into Neo4j as nodes.

  -  Iterate Through Rows: Loops over each row of the entities_df DataFrame.
  -  Sanitize Input: Cleans normalized_id and normalized_type to ensure compatibility.
  -  Cypher Query:
        - MERGE: Creates a node if it doesn’t already exist.
        - ON CREATE SET: Sets attributes (types, PubMedID) when the node is created.
        - ON MATCH SET: Updates attributes if they already exist:
        - Appends new types to types if not already included.
        - Appends new PubMedID values to ensure all related IDs are captured.
    - Execution: Executes the Cypher query for each entity.
    
3. **Insert Edges (Relationships)**: Insert relationships between entities as edges.

   - Iterate Through Rows: Loops over each row of the relationships_df DataFrame.
    
   - Sanitize Input: Cleans source, target, and normalized_relation.
    
   - Cypher Query:
       - MATCH: Finds the source and target nodes in the graph.
       - MERGE: Creates a relationship if it doesn’t already exist.
       - ON CREATE SET: Sets the PubMedID when the relationship is created.
       - ON MATCH SET: Appends additional PubMedID values to existing relationships.
   - Execution: Executes the Cypher query for each relationship.


In [None]:
import pandas as pd
from neo4j import GraphDatabase
import re

# Neo4j connection details
uri = "neo4j://10.136.16.40:7687"  # Change to your Neo4j instance's bolt URI
username = "neo4j"              # Neo4j username
password = "orion690"           # Neo4j password

#uri = "bolt://localhost:7999"  # Change to your Neo4j instance's bolt URI
#username = "neo4j"              # Neo4j username
#password = "password"           # Neo4j password


# Connect to Neo4j
driver = GraphDatabase.driver(uri, auth=(username, password))

# Function to sanitize strings for Neo4j compatibility
def sanitize_neo4j_input(value, is_relationship=False):
    """
    Cleans and formats strings for Neo4j. For relationships, preserves underscores and removes spaces.
    """
    if not isinstance(value, str):
        return value  # Return as-is if input is not a string

    sanitized_string = (
        value
        .replace("-", " ")  # Replace hyphens with spaces
        .replace("/", " ")  # Replace slashes with spaces
        .replace("'", "")   # Remove single quotes
        .replace('"', "")   # Remove double quotes
        .replace("(", "")   # Remove opening parentheses
        .replace(")", "")   # Remove closing parentheses
        .replace(":", " ")  # Replace colons with spaces
        .replace(".", " ")  # Replace periods with spaces
        .replace("%", " ")  # Replace percent signs with spaces
    )
    # For relationships, preserve underscores and replace spaces with underscores
    if is_relationship:
        sanitized_string = sanitized_string.replace(" ", "_")
    else:
        sanitized_string = sanitized_string.replace(" ", " ")  # Keep spaces for nodes

    # Remove any remaining non-alphanumeric characters (except underscores for relationships)
    sanitized_string = re.sub(r"[^a-zA-Z0-9_ ]" if is_relationship else r"[^a-zA-Z0-9 ]", "", sanitized_string).strip()

    return sanitized_string.lower()  # Convert to lowercase


# Function to insert entities into Neo4j with PubMedID
def insert_entities(entities_df):
    with driver.session() as session:
        for index, row in entities_df.iterrows():
            entity_id = sanitize_neo4j_input(row['normalized_id'])
            entity_type = sanitize_neo4j_input(row['normalized_type'])
            pubmed_id = row['PubMedID'] if 'PubMedID' in row else 'N/A'

            # Cypher query to merge entities based on id and append types and PubMedID if they differ
            query = f"""
            MERGE (e:Entity {{id: '{entity_id}'}})
            ON CREATE SET e.types = '{entity_type}', e.PubMedID = '{pubmed_id}'
            ON MATCH SET e.types = CASE
                WHEN '{entity_type}' IN split(e.types, ', ') THEN e.types
                ELSE e.types + ', ' + '{entity_type}'
            END,
            e.PubMedID = CASE
                WHEN '{pubmed_id}' IN split(e.PubMedID, ', ') THEN e.PubMedID
                ELSE e.PubMedID + ', ' + '{pubmed_id}'
            END
            """
            session.run(query)

# Function to insert relationships into Neo4j with PubMedID
def insert_relationships(relationships_df):
    with driver.session() as session:
        for index, row in relationships_df.iterrows():
            source = sanitize_neo4j_input(row['source'])
            target = sanitize_neo4j_input(row['target'])
            # Replace spaces in relationships with underscores
            relation = sanitize_neo4j_input(row['normalized_relation']).replace(" ", "_")
            pubmed_id = row['PubMedID'] if 'PubMedID' in row else 'N/A'

            # Cypher query to match the entities and create relationships with PubMedID attribute
            query = f"""
            MATCH (source:Entity {{id: '{source}'}}),
                  (target:Entity {{id: '{target}'}})
            MERGE (source)-[r:{relation}]->(target)
            ON CREATE SET r.PubMedID = '{pubmed_id}'
            ON MATCH SET r.PubMedID = CASE
                WHEN '{pubmed_id}' IN split(r.PubMedID, ', ') THEN r.PubMedID
                ELSE r.PubMedID + ', ' + '{pubmed_id}'
            END
            """
            session.run(query)

# Load the CSV files into pandas DataFrames
entities_df = pd.read_csv('normalized_entities_final2_10.csv', names=["original_id","original_type","normalized_id","normalized_type","PubMedID"])
relationships_df = pd.read_csv('normalized_relationships_final2_10.csv', names=["source","target","original_relation","normalized_relation","PubMedID"])

# Insert entities into Neo4j
insert_entities(entities_df)

# Insert relationships into Neo4j
insert_relationships(relationships_df)

# Close Neo4j connection
driver.close()

print("Data insertion complete!")
























































































































































































































































































































































































































































Data insertion complete!
