# Clinical Database Demo: Entity Extraction & Deduplication

This notebook demonstrates how to use **GLinker** to extract structured data from unstructured clinical notes and match entities against an existing database to avoid duplicate records.

## Use Case

Healthcare systems often need to:
1. Extract entities (Patients, Doctors, Diseases) from clinical notes
2. Check if entities already exist in the database
3. Only insert **new** records to avoid duplicates

## Key Features Demonstrated

- **Zero-shot NER** with custom labels (Patient, Doctor, Disease)
- **Name variation handling** through database aliases and L2/L3 matching
- **Entity linking** to match extracted text to database records
- **Deduplication logic** to skip existing records

## Setup

In [1]:
from glinker import ConfigBuilder, DAGExecutor, DAGPipeline

* 'fields' has been removed
  class PipeNode(BaseModel):


## Configure GLinker Pipeline

We'll set up a 4-layer pipeline:
- **L1**: Zero-shot entity extraction using GLiNER
- **L2**: Dictionary lookup for candidate generation
- **L3**: Entity linking/disambiguation
- **L0**: Aggregation of results from all layers

In [2]:
builder = ConfigBuilder(name="clinical_db_pipeline")
    
# Set schema template to use only labels (not descriptions) for L3 matching
builder.set_schema_template("{label}")

# L1: Zero-Shot NER
# We define the labels pertinent to our DB schema
builder.l1.gliner(
    model="knowledgator/gliner-bi-base-v2.0",
    labels=["Patient", "Doctor", "Disease", "Symptom"], 
    threshold=0.3
)

# L2: Dictionary Lookup (Candidate Generation)
builder.l2.add(
    "dict",
    priority=0,
    search_mode=["exact", "fuzzy"],
    fuzzy={"max_distance": 2, "min_similarity": 0.8}
)

# L3: Entity Linking (Disambiguation)
builder.l3.configure(
    model="knowledgator/gliner-bi-edge-v2.0", # Using fast edge model
    threshold=0.3,  # Lowered from 0.5 to improve linking
    device="cpu",
    max_length=512
)

# L0: Aggregation
builder.l0.configure(
    min_confidence=0.4, 
    include_unlinked=True # Critical: We need unlinked entities to detect "New" records
)

config = builder.get_config()
pipeline = DAGPipeline(**config)
executor = DAGExecutor(pipeline)



Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

  instance.resize_embeddings()


Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]



## Load Mock Database

Our mock database contains existing records:
- **Patients**: John Doe (P001), Sarah Connor (P002)
- **Doctors**: Dr. Gregory House (D001), Dr. Stephen Strange (D002)
- **Diseases**: Diabetes Mellitus (C001), Hypertension (C002)

In [3]:
MOCK_DB_PATH = "../data/example_mock_db.jsonl"
print(f"Loading existing records from {MOCK_DB_PATH}...")
executor.load_entities(MOCK_DB_PATH, target_layers=['dict'])
print("Database loaded.\n")

Loading existing records from ../data/example_mock_db.jsonl...
Database loaded.



## Process Clinical Notes

### Note 1: Existing Entities
This note contains entities that should match existing database records:
- "Dr. House" â†’ Dr. Gregory House (D001) *(via alias)*
- "Jon Doe" â†’ John Doe (P001) *(via alias)*
- "high blood pressure" â†’ Hypertension (C002)

In [4]:
note_1 = "Dr. House checked patient Jon Doe who complained of high blood pressure."

print(f"ðŸ“„ Note 1: \"{note_1}\"\n")

context = executor.execute({"texts": [note_1]})
results = context.data.get('l0_result')

if results and results.entities:
    entities = results.entities[0]
    
    print("   --- Database Action Log ---")
    for ent in entities:
        entity_text = ent.mention_text
        entity_type = ent.label
        
        if ent.is_linked:
            link = ent.linked_entity
            eid = link.entity_id
            print(f"   âœ… [EXISTING] Matched '{entity_text}' ({entity_type}) -> ID: {eid} ({link.label})")
            print(f"       -> Action: SKIP INSERTION (Record exists)")
        else:
            print(f"   ðŸ†• [NEW RECORD] '{entity_text}' ({entity_type}) -> Insert into Database?")
            table = "DOCTORS" if entity_type == "Doctor" else "PATIENTS" if entity_type == "Patient" else "DISEASES"
            print(f"       -> Action: INSERT into {table} table")

ðŸ“„ Note 1: "Dr. House checked patient Jon Doe who complained of high blood pressure."

   --- Database Action Log ---
   âœ… [EXISTING] Matched 'high blood pressure' (Disease) -> ID: C002 (Hypertension)
       -> Action: SKIP INSERTION (Record exists)
   âœ… [EXISTING] Matched 'Jon Doe' (Patient) -> ID: P001 (John Doe)
       -> Action: SKIP INSERTION (Record exists)
   âœ… [EXISTING] Matched 'Dr. House' (Doctor) -> ID: D001 (Dr. Gregory House)
       -> Action: SKIP INSERTION (Record exists)


### Note 2: New Entities
This note contains entities that don't exist in the database and should be inserted:
- "Dr. Meredith Grey" â†’ NEW
- "Jane Smith" â†’ NEW
- "Arrhythmia" â†’ NEW

In [5]:
note_2 = "Referral: Dr. Meredith Grey examining new patient Jane Smith for possible Arrhythmia."

print(f"\nðŸ“„ Note 2: \"{note_2}\"\n")

context = executor.execute({"texts": [note_2]})
results = context.data.get('l0_result')

if results and results.entities:
    entities = results.entities[0]
    
    print("   --- Database Action Log ---")
    for ent in entities:
        entity_text = ent.mention_text
        entity_type = ent.label
        
        if ent.is_linked:
            link = ent.linked_entity
            eid = link.entity_id
            print(f"   âœ… [EXISTING] Matched '{entity_text}' ({entity_type}) -> ID: {eid} ({link.label})")
            print(f"       -> Action: SKIP INSERTION (Record exists)")
        else:
            print(f"   ðŸ†• [NEW RECORD] '{entity_text}' ({entity_type}) -> Insert into Database?")
            table = "DOCTORS" if entity_type == "Doctor" else "PATIENTS" if entity_type == "Patient" else "DISEASES"
            print(f"       -> Action: INSERT into {table} table")


ðŸ“„ Note 2: "Referral: Dr. Meredith Grey examining new patient Jane Smith for possible Arrhythmia."

   --- Database Action Log ---
   ðŸ†• [NEW RECORD] 'Dr. Meredith Grey' (Doctor) -> Insert into Database?
       -> Action: INSERT into DOCTORS table
   ðŸ†• [NEW RECORD] 'Jane Smith' (Patient) -> Insert into Database?
       -> Action: INSERT into PATIENTS table
   ðŸ†• [NEW RECORD] 'Arrhythmia' (Disease) -> Insert into Database?
       -> Action: INSERT into DISEASES table


In [6]:
note_3 = "Patient John H Doe returned for follow-up appointment."

print(f"\nðŸ“„ Note 3: \"{note_3}\"\n")

context = executor.execute({"texts": [note_3]})
results = context.data.get('l0_result')

if results and results.entities:
    entities = results.entities[0]
    
    print("   --- Database Action Log ---")
    for ent in entities:
        entity_text = ent.mention_text
        entity_type = ent.label
        # Skip generic entity type names (e.g., "Patient", "Doctor", "Disease")
        if entity_text.lower() == entity_type.lower():
            continue
        
        if ent.is_linked:
            link = ent.linked_entity
            eid = link.entity_id
            print(f"   âœ… [EXISTING] Matched '{entity_text}' ({entity_type}) -> ID: {eid} ({link.label})")
            print(f"       -> Action: SKIP INSERTION (Record exists)")
        else:
            print(f"   ðŸ†• [NEW RECORD] '{entity_text}' ({entity_type}) -> Insert into Database?")
            table = "DOCTORS" if entity_type == "Doctor" else "PATIENTS" if entity_type == "Patient" else "DISEASES"
            print(f"       -> Action: INSERT into {table} table")


ðŸ“„ Note 3: "Patient John H Doe returned for follow-up appointment."

   --- Database Action Log ---
   âœ… [EXISTING] Matched 'John H Doe' (Patient) -> ID: P001 (John Doe)
       -> Action: SKIP INSERTION (Record exists)


## Summary

This demo shows how GLinker can:

1. **Extract entities** from unstructured text using zero-shot learning
2. **Match variations** ("Jon Doe" vs "John Doe") through database aliases and L2/L3 matching
3. **Link to existing records** to avoid duplicate database entries
4. **Identify new entities** that need to be inserted

**Name variation handling** is achieved through:
- Database aliases (e.g., "Jon Doe", "John H Doe" as aliases for "John Doe")
- L2 fuzzy search (finds similar candidates)
- L3 disambiguation (confirms correct matches)