# Entity Resolution Demo Notebook

This notebook was created to showcase how Neo4j can help to indentfy and resolve duplication cause by near-similarities in your database. 

There are a few pre-requisites to take care of before we get started. 


## 0. Pre-requisites


### Python Packages
Be sure you have installed the following python packages

`pip3 install faker neo4j`

or 

`conda install faker conda-forge::neo4j-python-driver`

If conda isntall is not working, please refer to [anaconda.org](https://anaconda.org/conda-forge/neo4j-python-driver)

In [1]:
import os
import re
import random
import time
import logging
from typing import Optional, Dict, List, Any
from faker import Faker
from neo4j import GraphDatabase, Driver

### Neo4j DataBase

Before starting this demo, make sure to create a new database and update the URI AND PASSWORD DB_NAME below. You may also need to update the USER or DB_NAM if those are different for your DB instance. 

We recommend using [Neo4j Desktop](https://neo4j.com/product/#neo4j-desktop) as it allows you to use Graph Data Science Library at no cost.  If you are using Neo4J Desktop, you can find your URI by clicking Details and copying the "Bolt port".


You will also need to ensure APOC and Graph Data Science are enabled on your instance. If you do not do this, you will receive errors later in the notebook.
- [Installing APOC](https://neo4j.com/docs/apoc/current/installation/)
- [Installing GDC](https://neo4j.com/docs/graph-data-science/current/installation/)


In [None]:
URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
USER = os.getenv("NEO4J_USER", "neo4j")
PASSWORD = os.getenv("NEO4J_PASSWORD", "password")
DB_NAME = os.getenv("NEO4J_DB", "neo4j")
BATCH_ID = "batch1"
TOTAL_NODES = 10000
DUPLICATE_PERCENT = 0.1

In this section, configure logging and add a helper function (get_driver) is also defined to easily obtain a Neo4j driver instance. This setup is essential to ensure the environment is ready for the subsequent tasks.

In [3]:
# --- Logging Setup ---
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

# --- Driver Helper ---
def get_driver() -> Driver:
    """Return a Neo4j driver instance."""
    return GraphDatabase.driver(URI, auth=(USER, PASSWORD))


## 1. Create mock data candidates

This section defines utility functions that support data generation.

- cleanup_previous_batch clears existing candidate data for the current batch.
- introduce_small_typo and introduce_phone_variation simulate common data imperfections by introducing slight variations.

These functions help create realistic, messy data for testing the entity resolution process.

In [4]:
def cleanup_previous_batch(driver: Driver) -> None:
    """Removes all Candidate nodes for the current batch."""
    with driver.session(database=DB_NAME) as session:
        session.run(f"MATCH (c:Candidate {{batchId:'{BATCH_ID}'}}) DETACH DELETE c")
    logger.info(f"Old Candidate nodes for batch '{BATCH_ID}' removed.")

def introduce_small_typo(original_str: str) -> str:
    """Introduce a small typo in a string."""
    if not original_str or len(original_str) < 5:
        return original_str
    if random.random() < 0.5:
        return original_str
    s_list = list(original_str)
    pos = random.randint(0, len(s_list) - 1)
    if random.random() < 0.5:
        del s_list[pos]
    else:
        s_list[pos] = chr(random.randint(ord('a'), ord('z')))
    return "".join(s_list)

def introduce_phone_variation(phone_str: str) -> str:
    """Slightly modifies a phone number."""
    if not phone_str:
        return phone_str
    if random.random() < 0.5:
        return phone_str  # Leave unchanged half the time
    only_digits = ''.join(filter(str.isdigit, phone_str))
    if len(only_digits) > 6:
        if random.random() < 0.5:
            only_digits = only_digits[:-1]
        else:
            only_digits = only_digits[:-1] + str(random.randint(0,9))
    formatted = only_digits
    if len(only_digits) > 3:
        formatted = f"({only_digits[:3]}) {only_digits[3:]}"
    return formatted


Here, we generate a set of candidate nodes using the Faker library, simulating a large dataset with intentional near-duplicates. The data is inserted into Neo4j in batches. This section serves to populate the database with candidate nodes, providing a realistic dataset for testing and development.

In [5]:
def create_mock_data(cleanup: bool = True) -> None:
    """
    Generates candidate data, injects near-duplicates, and inserts into Neo4j.
    """
    fake = Faker()
    Faker.seed(42)
    driver = get_driver()
    logger.info("Connected to Neo4j for mock data generation.")
    
    if cleanup:
        cleanup_previous_batch(driver)
    
    candidates: List[Dict[str, Any]] = []
    for i in range(TOTAL_NODES):
        candidate = {
            "candidateId": f"cand_{i}",
            "batchId": BATCH_ID,
            "fullName": fake.name(),
            "email": fake.email(),
            "phoneNumber": fake.phone_number(),
            "address": fake.address().replace("\n", ", ")
        }
        candidates.append(candidate)
    
    # Inject near-duplicates.
    num_duplicates = int(TOTAL_NODES * DUPLICATE_PERCENT)
    for _ in range(num_duplicates):
        original = random.choice(candidates)
        duplicate = {
            "candidateId": f"dup_{original['candidateId']}_{random.randint(1,100000)}",
            "batchId": BATCH_ID,
            "fullName": introduce_small_typo(original["fullName"]),
            "email": introduce_small_typo(original["email"]),
            "phoneNumber": introduce_phone_variation(original["phoneNumber"]),
            "address": introduce_small_typo(original["address"])
        }
        candidates.append(duplicate)
    
    random.shuffle(candidates)
    batch_size = 1000
    with driver.session(database=DB_NAME) as session:
        for i in range(0, len(candidates), batch_size):
            batch = candidates[i:i+batch_size]
            cypher = """
            UNWIND $rows AS row
            CREATE (c:Candidate {
              candidateId: row.candidateId,
              batchId: row.batchId,
              fullName: row.fullName,
              email: row.email,
              phoneNumber: row.phoneNumber,
              address: row.address
            })
            """
            session.run(cypher, parameters={"rows": batch})
            logger.info(f"Inserted batch up to index {i}")
    
    driver.close()
    logger.info("Mock data generation complete.")

# Run this cell to generate the mock data:
create_mock_data(cleanup=True)


INFO: Connected to Neo4j for mock data generation.
INFO: Old Candidate nodes for batch 'batch1' removed.
INFO: Inserted batch up to index 0
INFO: Inserted batch up to index 1000
INFO: Inserted batch up to index 2000
INFO: Inserted batch up to index 3000
INFO: Inserted batch up to index 4000
INFO: Inserted batch up to index 5000
INFO: Inserted batch up to index 6000
INFO: Inserted batch up to index 7000
INFO: Inserted batch up to index 8000
INFO: Inserted batch up to index 9000
INFO: Inserted batch up to index 10000
INFO: Mock data generation complete.


## 2. Create "fraud family" clusters

 This section focuses on inserting two specific demo clusters into the database. These clusters (for example, a fraud family and multiple variations of a single individual) serve as controlled test cases. The clusters are inserted using APOC's periodic iteration, which is efficient for batch processing. This controlled data helps in validating the entity resolution logic.
 
Note: apoc library must be installed in Neo4j https://neo4j.com/docs/apoc/current/installation/


In [6]:
# Demo clusters for testing resolution
CLUSTER_1_DATA = [
    {
        "candidateId": "FRAUD_101",
        "batchId": BATCH_ID,
        "fullName": "Theodore Chadwick",
        "email": "theo.chadwick@gmail.com",
        "phoneNumber": "555-1234",
        "address": "123 Fraud Rd, Chicago"
    },
    # ... (other records for cluster 1)
]

CLUSTER_2_DATA = [
    {
        "candidateId": "PERSON_201",
        "batchId": BATCH_ID,
        "fullName": "Jessica Parsons",
        "email": "jessparsons@gmail.com",
        "phoneNumber": "423-502-1235",
        "address": "99 Demo Ln, Springfield"
    },
    # ... (other records for cluster 2)
]

def create_clusters() -> None:
    """Insert demo clusters into Neo4j."""
    driver = get_driver()
    logger.info("Connected to Neo4j for inserting demo clusters.")
    with driver.session(database=DB_NAME) as session:
        for cluster_data, batch_size in [(CLUSTER_1_DATA, 5), (CLUSTER_2_DATA, 6)]:
            session.run(
                """
                CALL apoc.periodic.iterate(
                  'UNWIND $rows AS row RETURN row',
                  'CREATE (c:Candidate {
                     candidateId: row.candidateId,
                     batchId: row.batchId,
                     fullName: row.fullName,
                     email: row.email,
                     phoneNumber: row.phoneNumber,
                     address: row.address
                   })',
                  {batchSize: $batchSize, parallel: false, params: {rows: $rows}}
                )
                """,
                parameters={"rows": cluster_data, "batchSize": batch_size},
            )
            logger.info("Inserted one demo cluster.")
    driver.close()
    logger.info("Demo clusters inserted successfully.")

# Run this cell to insert demo clusters:
create_clusters()

INFO: Connected to Neo4j for inserting demo clusters.
INFO: Inserted one demo cluster.
INFO: Inserted one demo cluster.
INFO: Demo clusters inserted successfully.


# 3. Normalize the data

Data normalization is critical for comparing candidate records reliably. In this section, functions are defined to normalize key properties such as full names, phone numbers, emails, and addresses. The normalize_properties pipeline updates the candidate nodes with standardized values (e.g., normalizedFullName). This consistency is necessary for accurate similarity comparisons later in the process.


### First, let's create some helper functions 

In [7]:
def normalize_phone(phone_str: Optional[str]) -> Optional[str]:
    """Remove non-digit characters from a phone number."""
    if not phone_str:
        return None
    digits = re.sub(r'[^0-9]', '', phone_str)
    return digits if digits else None

def normalize_email(email_str: Optional[str]) -> Optional[str]:
    """Lowercase and trim email."""
    if not email_str:
        return None
    return email_str.strip().lower()

def normalize_address(addr_str: Optional[str]) -> Optional[str]:
    """Simplify and lowercase an address."""
    if not addr_str:
        return None
    addr_str = addr_str.lower()
    addr_str = re.sub(r'[.,#]', '', addr_str)
    addr_str = addr_str.replace(" street", " st").replace(" avenue", " ave")
    return addr_str.strip()

def normalize_name(name_str: Optional[str]) -> Optional[str]:
    """Lowercase and trim a full name."""
    if not name_str:
        return None
    return name_str.strip().lower()

def normalize_properties() -> None:
    """
    Normalizes fullName, phone, email, and address for all candidates.
    Updates each Candidate node with normalized properties.
    """
    driver = get_driver()
    logger.info("Connected to Neo4j for property normalization.")
    with driver.session(database=DB_NAME) as session:
        fetch_query = f"""
        MATCH (c:Candidate {{batchId:'{BATCH_ID}'}})
        RETURN c.candidateId AS candidateId, c.fullName AS fullName,
               c.phoneNumber AS phone, c.email AS email, c.address AS address
        """
        result = session.run(fetch_query)
        updates: List[Dict[str, Any]] = []
        for record in result:
            updates.append({
                "candidateId": record["candidateId"],
                "normalizedFullName": normalize_name(record["fullName"]),
                "normalizedPhone": normalize_phone(record["phone"]),
                "normalizedEmail": normalize_email(record["email"]),
                "normalizedAddress": normalize_address(record["address"])
            })
        batch_size = 1000
        for i in range(0, len(updates), batch_size):
            batch = updates[i:i+batch_size]
            update_cypher = """
            UNWIND $rows AS row
            MATCH (c:Candidate {candidateId: row.candidateId})
            SET c.normalizedFullName = row.normalizedFullName,
                c.normalizedPhone = row.normalizedPhone,
                c.normalizedEmail = row.normalizedEmail,
                c.normalizedAddress = row.normalizedAddress
            """
            session.run(update_cypher, parameters={"rows": batch})
            logger.info(f"Normalized batch up to index {i}")
    driver.close()
    logger.info("Property normalization complete.")

# Run this cell to normalize properties:
normalize_properties()

# (Verify in Neo4j Browser with:
# MATCH (c:Candidate) RETURN c.fullName, c.normalizedFullName LIMIT 10;)


INFO: Connected to Neo4j for property normalization.
INFO: Normalized batch up to index 0
INFO: Normalized batch up to index 1000
INFO: Normalized batch up to index 2000
INFO: Normalized batch up to index 3000
INFO: Normalized batch up to index 4000
INFO: Normalized batch up to index 5000
INFO: Normalized batch up to index 6000
INFO: Normalized batch up to index 7000
INFO: Normalized batch up to index 8000
INFO: Normalized batch up to index 9000
INFO: Normalized batch up to index 10000
INFO: Normalized batch up to index 11000
INFO: Property normalization complete.


# 4. Calculate Similarity

This section sets up candidate matching by first creating indexes on normalized properties and optionally generating blocking keys to reduce comparison overhead. Functions are then defined to create SIMILAR relationships based on full name, email, phone number, and address similarities using measures like Jaro-Winkler and Levenshtein distances. Running these functions links candidate nodes that meet the similarity criteria.

In [8]:
def create_candidate_indexes(driver: Driver) -> None:
    """Creates indexes on candidate properties."""
    index_queries = [
        "CREATE INDEX candidate_candidateId_index IF NOT EXISTS FOR (c:Candidate) ON (c.candidateId)",
        "CREATE INDEX candidate_phone_index IF NOT EXISTS FOR (c:Candidate) ON (c.normalizedPhone)",
        "CREATE INDEX candidate_email_index IF NOT EXISTS FOR (c:Candidate) ON (c.normalizedEmail)",
        "CREATE INDEX candidate_address_index IF NOT EXISTS FOR (c:Candidate) ON (c.normalizedAddress)"
    ]
    with driver.session(database=DB_NAME) as session:
        for query in index_queries:
            session.run(query)
    logger.info("Candidate indexes created.")

def create_soundex_blocking(driver: Driver) -> None:
    """
    Creates BlockKey nodes using soundex of normalizedFullName.
    Disable if blocking causes issues.
    """
    with driver.session(database=DB_NAME) as session:
        session.run("MATCH (b:BlockKey) DETACH DELETE b")
        query = f"""
        CALL apoc.periodic.iterate(
          'MATCH (c:Candidate {{batchId:"{BATCH_ID}"}}) RETURN c',
          'WITH c, apoc.text.soundex(c.normalizedFullName) AS sdx
           MERGE (bk:BlockKey {{value: sdx}})
           MERGE (c)-[:HAS_BLOCK]->(bk)',
          {{batchSize:1000, parallel:false}}
        )
        """
        session.run(query)
    logger.info("Soundex blocking applied.")

def create_similarity_by_name(driver: Driver, jaro_threshold: float = 0.15) -> None:
    """Creates SIMILAR relationships based on full name similarity."""
    query = f"""
    CALL apoc.periodic.iterate(
      "MATCH (c:Candidate {{batchId:'{BATCH_ID}'}}) RETURN c",
      "MATCH (c2:Candidate {{batchId:'{BATCH_ID}'}}) 
       WHERE id(c) < id(c2)
       WITH c, c2, apoc.text.jaroWinklerDistance(c.normalizedFullName, c2.normalizedFullName) AS dist
       WHERE dist < {jaro_threshold}
       CREATE (c)-[:SIMILAR {{
         comparedProperty: 'fullName',
         similarity: (1.0 - dist)
       }}]->(c2)",
      {{batchSize:200, parallel:false}}
    )
    """
    with driver.session(database=DB_NAME) as session:
        session.run(query)
    logger.info("Created SIMILAR relationships by fullName.")

def create_similarity_by_email(driver: Driver, similarity_threshold: float = 0.9) -> None:
    """Creates SIMILAR relationships based on email similarity."""
    query = f"""
    CALL apoc.periodic.iterate(
      "MATCH (c:Candidate {{batchId:'{BATCH_ID}'}}) WHERE c.normalizedEmail IS NOT NULL RETURN c",
      "MATCH (c2:Candidate {{batchId:'{BATCH_ID}'}}) 
       WHERE c2.normalizedEmail IS NOT NULL AND id(c) < id(c2)
       WITH c, c2,
            apoc.text.levenshteinDistance(c.normalizedEmail, c2.normalizedEmail) AS dist,
            CASE WHEN size(c.normalizedEmail) >= size(c2.normalizedEmail) THEN size(c.normalizedEmail) ELSE size(c2.normalizedEmail) END AS maxLen
       WITH c, c2, 1.0 - (toFloat(dist)/toFloat(maxLen)) AS sim
       WHERE sim >= {similarity_threshold}
       CREATE (c)-[:SIMILAR {{
         comparedProperty: 'email',
         similarity: sim
       }}]->(c2)",
      {{batchSize:200, parallel:false}}
    )
    """
    with driver.session(database=DB_NAME) as session:
        session.run(query)
    logger.info("Created SIMILAR relationships by email.")

def create_similarity_by_phone(driver: Driver, similarity_threshold: float = 0.9) -> None:
    """Creates SIMILAR relationships based on phone similarity."""
    query = f"""
    CALL apoc.periodic.iterate(
      "MATCH (c:Candidate {{batchId:'{BATCH_ID}'}}) WHERE c.normalizedPhone IS NOT NULL RETURN c",
      "MATCH (c2:Candidate {{batchId:'{BATCH_ID}'}}) 
       WHERE c2.normalizedPhone IS NOT NULL AND id(c) < id(c2)
       WITH c, c2,
            apoc.text.levenshteinDistance(c.normalizedPhone, c2.normalizedPhone) AS dist,
            CASE WHEN size(c.normalizedPhone) >= size(c2.normalizedPhone) THEN size(c.normalizedPhone) ELSE size(c2.normalizedPhone) END AS maxLen
       WITH c, c2, 1.0 - (toFloat(dist)/toFloat(maxLen)) AS sim
       WHERE sim >= {similarity_threshold}
       CREATE (c)-[:SIMILAR {{
         comparedProperty: 'phoneNumber',
         similarity: sim
       }}]->(c2)",
      {{batchSize:200, parallel:false}}
    )
    """
    with driver.session(database=DB_NAME) as session:
        session.run(query)
    logger.info("Created SIMILAR relationships by phoneNumber.")

def create_similarity_by_address(driver: Driver, jaro_threshold: float = 0.2) -> None:
    """Creates SIMILAR relationships based on address similarity."""
    query = f"""
    CALL apoc.periodic.iterate(
      "MATCH (c:Candidate {{batchId:'{BATCH_ID}'}}) WHERE c.normalizedAddress IS NOT NULL RETURN c",
      "MATCH (c2:Candidate {{batchId:'{BATCH_ID}'}}) 
       WHERE c2.normalizedAddress IS NOT NULL AND id(c) < id(c2)
       WITH c, c2, apoc.text.jaroWinklerDistance(c.normalizedAddress, c2.normalizedAddress) AS dist
       WHERE dist < {jaro_threshold}
       CREATE (c)-[:SIMILAR {{
         comparedProperty: 'address',
         similarity: (1.0 - dist)
       }}]->(c2)",
      {{batchSize:200, parallel:false}}
    )
    """
    with driver.session(database=DB_NAME) as session:
        session.run(query)
    logger.info("Created SIMILAR relationships by address.")

# Run these cells (one at a time or in a block) to create indexes, apply blocking, and generate SIMILAR relationships:
driver_instance = get_driver()
create_candidate_indexes(driver_instance)
# Optionally: create_soundex_blocking(driver_instance)  # Uncomment if blocking is desired.
create_similarity_by_name(driver_instance, jaro_threshold=0.15)
create_similarity_by_email(driver_instance, similarity_threshold=0.9)
create_similarity_by_phone(driver_instance, similarity_threshold=0.9)
create_similarity_by_address(driver_instance, jaro_threshold=0.2)
driver_instance.close()


INFO: Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE RANGE INDEX candidate_candidateId_index IF NOT EXISTS FOR (e:Candidate) ON (e.candidateId)` has no effect.} {description: `RANGE INDEX candidate_candidateId_index FOR (e:Candidate) ON (e.candidateId)` already exists.} {position: None} for query: 'CREATE INDEX candidate_candidateId_index IF NOT EXISTS FOR (c:Candidate) ON (c.candidateId)'
INFO: Received notification from DBMS server: {severity: INFORMATION} {code: Neo.ClientNotification.Schema.IndexOrConstraintAlreadyExists} {category: SCHEMA} {title: `CREATE RANGE INDEX candidate_phone_index IF NOT EXISTS FOR (e:Candidate) ON (e.normalizedPhone)` has no effect.} {description: `RANGE INDEX candidate_phone_index FOR (e:Candidate) ON (e.normalizedPhone)` already exists.} {position: None} for query: 'CREATE INDEX candidate_phone_index IF NOT EXISTS FOR (c:Candidate) ON

# 5. Duplicate Resolution Functions

Two strategies for handling duplicates are introduced in this section.

merge_high_confidence merges candidate nodes that have an aggregated similarity score above a specified threshold, effectively combining duplicates (a destructive approach).
link_high_confidence creates a :SAME_AS relationship between high-confidence duplicate nodes without merging them (a non-destructive approach).
These functions help further refine the deduplication process based on the quality of the similarity scores.

In [9]:
def merge_high_confidence(driver: Driver, threshold: float = 2.5) -> None:
    """
    Merge candidate nodes if their aggregated similarity (weightedSum) is 
    greater than or equal to the threshold. This is destructive.
    """
    query = f"""
    CALL apoc.periodic.iterate(
      "MATCH (c1:Candidate)-[r:AGGREGATED_SIMILAR]->(c2:Candidate)
       WHERE r.weightedSum >= {threshold}
       RETURN c1, c2",
      "CALL apoc.refactor.mergeNodes([c1, c2], {{
         properties: 'combine',
         mergeRels: true
      }}) YIELD node RETURN node",
      {{batchSize: 100, parallel: false}}
    )
    """
    with driver.session(database=DB_NAME) as session:
        session.run(query)
    logger.info(f"Nodes merged where weightedSum >= {threshold}.")

def link_high_confidence(driver: Driver, threshold: float = 2.5) -> None:
    """
    Create :SAME_AS relationships between candidate nodes if their aggregated 
    similarity (weightedSum) is greater than or equal to the threshold.
    This is non-destructive.
    """
    query = f"""
    MATCH (c1:Candidate)-[r:AGGREGATED_SIMILAR]->(c2:Candidate)
    WHERE r.weightedSum >= {threshold}
    MERGE (c1)-[:SAME_AS {{confidence: r.weightedSum}}]->(c2)
    """
    with driver.session(database=DB_NAME) as session:
        session.run(query)
    logger.info(f"Linked nodes with :SAME_AS where weightedSum >= {threshold}.")


In [10]:
# Get a driver instance (or reuse your existing driver if still open)
driver_instance = get_driver()

# Choose one or both strategies:
# Merge nodes (destructive):
# merge_high_confidence(driver_instance, threshold=2.5)

# Link nodes (non-destructive):
link_high_confidence(driver_instance, threshold=2.5)

driver_instance.close()


INFO: Linked nodes with :SAME_AS where weightedSum >= 2.5.


# 6.  Master Entity Resolution Functions
Description:

This section focuses on deduplication by consolidating candidate nodes into master nodes. A master node is created for each unique community (determined by the clustering process), and candidate nodes are linked to their corresponding master node. Additionally, canonical properties are computed by aggregating values from the candidate nodes. This step results in a cleaned, deduplicated view of the data.

In [11]:
def create_master_nodes_and_links() -> None:
    """
    Creates distinct MasterEntity nodes for each candidate community (entityId) and links candidates to them.
    """
    driver = get_driver()
    with driver.session(database=DB_NAME) as session:
        create_master_nodes_query = """
        CALL apoc.periodic.iterate(
          "MATCH (c:Candidate) WHERE c.entityId IS NOT NULL RETURN DISTINCT c.entityId AS communityId",
          "MERGE (m:MasterEntity {communityId: communityId})",
          {batchSize: 1000, parallel: false}
        )
        """
        session.run(create_master_nodes_query)
        logger.info("MasterEntity nodes created.")
        link_candidates_query = """
        CALL apoc.periodic.iterate(
          "MATCH (c:Candidate) WHERE c.entityId IS NOT NULL RETURN c",
          "MATCH (m:MasterEntity {communityId: c.entityId}) MERGE (c)-[:BELONGS_TO]->(m)",
          {batchSize: 1000, parallel: false}
        )
        """
        session.run(link_candidates_query)
        logger.info("Candidates linked to MasterEntity nodes.")
    driver.close()

def set_canonical() -> None:
    """
    Computes canonical property values for each MasterEntity from its related Candidate nodes.
    """
    driver = get_driver()
    with driver.session(database=DB_NAME) as session:
        query = """
        CALL apoc.periodic.iterate(
          'MATCH (m:MasterEntity) RETURN m',
          'MATCH (c:Candidate {entityId: m.communityId})
           WITH m, 
                collect(c.fullName) AS allNames, 
                collect(c.email) AS allEmails, 
                collect(c.phoneNumber) AS allPhones, 
                collect(c.address) AS allAddresses
           WITH m,
                apoc.coll.frequenciesAsMap(allNames) AS nameFreq,
                apoc.coll.frequenciesAsMap(allEmails) AS emailFreq,
                apoc.coll.frequenciesAsMap(allPhones) AS phoneFreq,
                apoc.coll.frequenciesAsMap(allAddresses) AS addrFreq
           WITH m,
                keys(nameFreq)[0] AS bestName,
                keys(emailFreq)[0] AS bestEmail,
                keys(phoneFreq)[0] AS bestPhone,
                keys(addrFreq)[0] AS bestAddress
           SET m.fullNameCanonical = bestName,
               m.emailCanonical = bestEmail,
               m.phoneNumberCanonical = bestPhone,
               m.addressCanonical = bestAddress',
          {batchSize:50, parallel:false}
        )
        """
        session.run(query)
        logger.info("Canonical properties set for MasterEntity nodes.")
    driver.close()

# Run these cells to create master entities and set canonical properties:
create_master_nodes_and_links()
set_canonical()


INFO: MasterEntity nodes created.
INFO: Candidates linked to MasterEntity nodes.
INFO: Canonical properties set for MasterEntity nodes.
