# Entity and Relationship Extraction with OpenAI and Neo4j

This code dynamically extracts entities and relationships from text (abstracts), normalizes and deduplicates them, and then stores them in a Neo4j graph database. The code uses OpenAI for extraction, fuzzy matching for normalization, and Neo4j for graph storage.



## Installations

In [1]:
pip install openai==0.28

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install fuzzywuzzy





## Imports

In [3]:
import openai
import json
from neo4j import GraphDatabase
import pandas as pd
from collections import defaultdict
from fuzzywuzzy import process



## Extracting Entities and Relationships

Dynamic Sets: These sets (known_entities and known_relationships) are initialized to store dynamically learned entities and relationships from each abstract.

Multiple Runs with Temperature Variation: This function calls OpenAI multiple times to extract entities and relationships from the provided abstract. The temperature setting ensures varied responses, making the extraction more robust.

Combining and Normalizing: After extracting entities and relationships, they are combined and normalized (to handle similar entities and deduplicate them).

In [4]:
# Set your OpenAI API key
openai.api_key = ""  # Add your key here
# Initialize dynamic sets for known entities and relationships
known_entities = set()
known_relationships = set()
# Function to extract entities and relationships from an abstract using OpenAI with temperature variation
def extract_entities_relationships_multiple_runs(abstract, num_runs=3):
    combined_entities = []
    combined_relationships = []

    for i in range(num_runs):
        temperature = 0.3 + (0.4 * (i / (num_runs - 1)))
        
        prompt = f"""
        Extract the entities and relationships from the following abstract:
        {abstract}

        Provide the output as a JSON in this format:
        {{
          "entities": [
            {{"id": "Entity1", "type": "Type1"}},
            {{"id": "Entity2", "type": "Type2"}}
          ],
          "relationships": [
            {{"source": "Entity1", "target": "Entity2", "relation": "RELATION_TYPE"}}
          ]
        }}
        Ensure the output uses double quotes for property names and values.
        """
        
        # Make API call to OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=1500
        )
        
        # Parse the response
        result = response['choices'][0]['message']['content']
        try:
            parsed_result = json.loads(result)
            entities = parsed_result.get('entities', [])
            relationships = parsed_result.get('relationships', [])
            
            combined_entities.extend(entities)
            combined_relationships.extend(relationships)
        
        except json.JSONDecodeError:
            print(f"Error parsing JSON for temperature {temperature}: {result}")
            continue

    # Deduplicate and normalize entities and relationships
    combined_entities = normalize_and_deduplicate_entities(combined_entities)
    combined_relationships = normalize_and_deduplicate_relationships(combined_relationships)
    
    return combined_entities, combined_relationships


## Normalization and Deduplication 

Entities: The entities are normalized using fuzzy matching to ensure that similar entities (like "cancer" and "Cancer") are treated as the same entity. Any duplicate entities are removed.

Relationships: Similar to entities, relationships are normalized (e.g., "associated with" and "related to" are treated the same) and duplicates are removed.

In [5]:
# Function to deduplicate and normalize entities
def normalize_and_deduplicate_entities(entities):
    global known_entities  # Use the dynamic known_entities set
    seen = set()
    unique_entities = []
    
    for entity in entities:
        normalized_entity = normalize_entity_name(entity['id'])
        if normalized_entity not in seen:
            seen.add(normalized_entity)
            entity['id'] = normalized_entity  # Update the entity ID to the normalized one
            unique_entities.append(entity)
            # Update the known_entities set dynamically
            known_entities.add(normalized_entity)
    
    return unique_entities

# Function to normalize and deduplicate relationships
def normalize_and_deduplicate_relationships(relationships):
    global known_relationships  # Use the dynamic known_relationships set
    seen = set()
    unique_relationships = []
    
    for relationship in relationships:
        source = normalize_entity_name(relationship['source'])
        target = normalize_entity_name(relationship['target'])
        relation = normalize_relationship_type(relationship['relation'])
        
        rel_tuple = (source, target, relation)
        if rel_tuple not in seen:
            seen.add(rel_tuple)
            unique_relationships.append({
                "source": source,
                "target": target,
                "relation": relation
            })
            # Update the known_relationships set dynamically
            known_relationships.add(relation)
    
    return unique_relationships


## Fuzzy Matching for Normalization

Entity and Relationship Normalization: These functions use fuzzy matching to compare extracted entities and relationships to known ones. If the similarity is above a threshold (80), the entity or relationship is normalized to a known value.

In [6]:
# Function to normalize entity names using fuzzy matching
def normalize_entity_name(entity_name):
    global known_entities  # Use the dynamic set of known entities
    if len(known_entities) == 0:
        # If the known_entities set is empty, return the entity name as is
        return entity_name

    # Fuzzy matching against known entities
    best_match = process.extractOne(entity_name.lower(), known_entities, scorer=process.fuzz.ratio)
    if best_match and best_match[1] > 80:  # Threshold for similarity
        return best_match[0]
    
    return entity_name

# Function to normalize relationship types using fuzzy matching
def normalize_relationship_type(relation):
    global known_relationships  # Use the dynamic set of known relationships
    if len(known_relationships) == 0:
        # If the known_relationships set is empty, return the relation as is
        return relation

    # Fuzzy matching against known relationships
    best_match = process.extractOne(relation.lower(), known_relationships, scorer=process.fuzz.ratio)
    if best_match and best_match[1] > 80:  # Threshold for similarity
        return best_match[0]
    
    return relation

## Neo4j Connection and Data Insertion

Inserting Data into Neo4j: This function inserts the normalized and deduplicated entities and relationships into the Neo4j graph database using the MERGE statement, ensuring no duplicates are created in the database.

In [7]:
# Function to insert entities and relationships into Neo4j
def insert_into_neo4j(entities, relationships):
    with driver.session() as session:
        # Insert entities
        for entity in entities:
            entity_id = escape_special_chars(entity['id'])
            entity_type = escape_special_chars(entity['type'])
            query = f"""
            MERGE (e:Entity {{id: '{entity_id}', type: '{entity_type}'}})
            """
            session.run(query)
        
        # Insert relationships
        for relationship in relationships:
            source = escape_special_chars(relationship['source'])
            target = escape_special_chars(relationship['target'])
            relation = escape_special_chars(relationship['relation'].upper().replace(" ", "_"))
            query = f"""
            MATCH (source:Entity {{id: '{source}'}}),
                  (target:Entity {{id: '{target}'}})
            MERGE (source)-[:{relation}]->(target)
            """
            session.run(query)


## Running the Extraction and Insertion

The example demonstrates how to extract entities and relationships from an abstract and insert them into Neo4j.

In [8]:
# Example usage: Run the function on an abstract
abstract = """research on the cardiovascular toxicity of angiogenesis inhibitors among patients with cancer in taiwan is lacking this observational study explored the risk of major adverse cardiovascular events maces associated with angiogenesis inhibitors in taiwan we conducted a nested casecontrol study using the tcr taiwan cancer registry linked with the taiwan national insurance claim database we matched every case with 4 controls using riskset sampling by index date age sex cancer type and cancer diagnosis date conditional logistic regression was used to evaluate the risks of maces and different cardiovascular events using propensity score adjustment or matching sensitivity analyses were used to evaluate the risks matched by cancer stages or exposure within 1 year among a cohort of 284 292 after the exclusion of prevalent cases the incidences of maces among the overall cohort and those exposed to angiogenesis inhibitors were 225 and 325 events per 1000 personyears respectively we matched 17 817 cases with 70 740 controls with a mean age of 749 years and 568 of patients were men after propensity score adjustment angiogenesis inhibitors were associated with increased risks of maces odds ratio 456 95 ci 1781159 significantly increased risks were noted for heart failure hospitalization myocardial infarction cerebrovascular accident and venous thromboembolism but not for newonset atrial fibrillation similar results were observed after matching by cancer stage or restriction of 1year exposure angiogenesis inhibitors were associated with increased risks of maces among patients with various malignancies in taiwan but were not associated with newonset atrial fibrillation
"""
entities, relationships = extract_entities_relationships_multiple_runs(abstract, num_runs=3)

# Print the final entities and relationships
print("Entities:", entities)
print("Relationships:", relationships)

# Connect to Neo4j
uri = "bolt://localhost:7999"  # Adjust for your Neo4j instance
username = "neo4j"
password = "password"
driver = GraphDatabase.driver(uri, auth=(username, password))

# Function to escape special characters for Cypher queries
def escape_special_chars(value):
    return value.replace("'", "''")

# Insert deduplicated entities and relationships into Neo4j
insert_into_neo4j(entities, relationships)

# Close Neo4j connection
driver.close()


Entities: [{'id': 'Cardiovascular toxicity', 'type': 'Medical condition'}, {'id': 'Angiogenesis inhibitors', 'type': 'Medication'}, {'id': 'Patients with cancer', 'type': 'Patient group'}, {'id': 'Taiwan', 'type': 'Location'}, {'id': 'Observational study', 'type': 'Study type'}, {'id': 'Major adverse cardiovascular events (MACEs)', 'type': 'Medical condition'}, {'id': 'Taiwan Cancer Registry', 'type': 'Database'}, {'id': 'Taiwan National Insurance Claim Database', 'type': 'Database'}, {'id': 'Case', 'type': 'Study group'}, {'id': 'Control', 'type': 'Study group'}, {'id': 'Age', 'type': 'Demographic factor'}, {'id': 'Sex', 'type': 'Demographic factor'}, {'id': 'Cancer type', 'type': 'Medical condition'}, {'id': 'Cancer diagnosis date', 'type': 'Date'}, {'id': 'Logistic regression', 'type': 'Statistical analysis method'}, {'id': 'Propensity score adjustment', 'type': 'Statistical analysis method'}, {'id': 'Sensitivity analyses', 'type': 'Statistical analysis method'}, {'id': 'Incidence',













