# Creating a Knowledge Graph from the output of the NER & RE models

### Instructions on how to work with Neo4j alongside this code

Before you start, download neo4j desktop from https://neo4j.com/download/

Then, create a project called "Text_Mining" (or whatever name you want, since this name is not used within this code). 

Click "Add" -> "Local DMBS" and name it "Text_Mining_Neo4j" (even though this name is also not used within this code) while setting the password to "bilalbroski1" (IMPORTANT!)

Then you have to click the blue "Start" button and wait for the server to start. After it is done loading, click the blue "Open" button. This will cause the Neo4j Browser to open. 

In Neo4j browser, run the command "MATCH (n) RETURN n" to show the complete KG, and run "MATCH (n) DETACH DELETE n" to completely empty the DMBS.

If any of this does not work, follow the steps in the "Download Neo4j" section of https://youtu.be/8jNPelugC2s?si=QV898-ggdLIq9XPk&t=597

In [1]:
import re
import os, os.path

from neo4j import GraphDatabase
from unidecode import unidecode
from num2words import num2words
import pickle
# from copy import deepcopy

# imports for loading the NER model using flair
from flair.data import Sentence
from flair.models import SequenceTagger

In [2]:
# load the NER model
custom_ner_model = SequenceTagger.load(r'flair_models\best-model.pt')

2023-10-31 23:17:57,329 SequenceTagger predicts: Dictionary with 10 tags: O, PLAYER, BIRTHDATE, COUNTRY, NATIONALITY, POSITION, CLUB, REFERENCE, <START>, <STOP>


In [3]:
# Set parameters for loading texts

# set to True in case you want to select a single specific text
select_text = True

# index of the specific text that you want to select
text_i = 369  # index of text for Thomas Alun Lockyer = 192; Mark Maria Hubertus Flekken = 369

# if select_text is set to False, there will be nr_texts selected
nr_texts = 2 # max number of texts is 1031

In [4]:
# import the texts that have to be converted into a KG (=Knowledge Graph)

# Create empty list to which the text(s) can be added
texts = []

if select_text: # only select the text with index text_i
    with open(f'./data/footballer_{text_i}.txt', 'r') as f:
        contents = f.read()
        texts.append(contents)
    f.close()
    print(f"The text with index number {text_i} has successfully been imported")

else: # select the first nr_texts texts
    for i in range(nr_texts):
        with open(f'./data/footballer_{i}.txt', 'r') as f:
            contents = f.read()
            texts.append(contents)
        f.close()

    print(f"{len(texts)} texts have been successfully imported")

print(texts)

The text with index number 369 has successfully been imported
['Mark Maria Hubertus Flekken (born 13 June 1993) is a Dutch professional footballer who plays as a goalkeeper for Premier League club Brentford and the Netherlands national team.Early years.Flekken grew up in Bocholtz, Limburg, Netherlands on the German border. His parents René and Annie used to play football themselves, and his younger brother Roy also became a goalkeeper. Club career.']


In [5]:
def augmented_text(sentence):
    '''
    Uses the entity predictions from the NER model to augment the text with entity tags.
    '''
    outp = sentence.text
    used_text = []
    for l in sentence.labels:
        text_token = l.labeled_identifier
        text_token = text_token.split()
        text, token = text_token[1].split('/')
        text = text.replace('"', '')
        if text not in used_text:
            outp = outp.replace(text, f'[{token}]{text}[{token}]')
            outp = outp.replace(f'[{token}] [{token}]', ' ')
            used_text.append(text)
        else: 
            continue
    return outp

In [6]:
# Perform NER on all texts to create a list with the texts in augmented form
augmented_texts = [] # Create empty list to append the augmented texts to

for text in texts: 
    sentence = Sentence(str(text))
    custom_ner_model.predict(sentence)
    augmented_sentence = augmented_text(sentence)
    augmented_texts.append(augmented_sentence)

print(augmented_texts)

['[PLAYER]Mark Maria Hubertus Flekken[PLAYER] (born [BIRTHDATE]13 June 1993[BIRTHDATE]) is a [NATIONALITY]Dutch[NATIONALITY] professional footballer who plays as a [POSITION]goalkeeper[POSITION] for Premier League club [CLUB]Brentford[CLUB] and the [COUNTRY]Netherlands[COUNTRY] national team.Early years.[PLAYER]Flekken[PLAYER] grew up in Bocholtz, Limburg, [COUNTRY]Netherlands[COUNTRY] on the German border. [REFERENCE]His[REFERENCE] parents René and Annie used to play football themselves, and his younger brother [PLAYER]Roy[PLAYER] also became a [POSITION]goalkeeper[POSITION]. Club career.']


## Preprocessing

In [7]:
def escapeRegExp(text):
    """
    Replace all of the opening brackets with "\[" to escape the special characters in the regrex.
    This is required to prevent unterminated character sets.
    If the unterminated character sets error still occurs, try running the commented out lines as well.
    Source: https://stackoverflow.com/questions/54135606/python-re-error-unterminated-character-set-at-position
    """
    edited_text = text.replace("[", "\[")
    # edited_text = edited_text.replace("{", "\{")
    # edited_text = edited_text.replace("(", "\(")
    # edited_text = edited_text.replace(")", "\)")
    return edited_text

In [8]:
final_texts = map(escapeRegExp, augmented_texts)

In [9]:
# final_texts2 = map(escapeRegExp, augmented_texts)
# for i in final_texts2:
#     print(i)

In [10]:
def fix_var_name(var_name: str):
    """
    Check string for violations of the Cypher variable name conventions,
    namely that it must always start with a letter.
    If the string starts with an integer, replace it with the word for this integer.
    For example, 2_october becomes two_october.
    Returns False if the string satisfies the naming conventions.
    """
    if len(var_name)==1: # Unhandable case
        return False

    if len(var_name) > 0 and var_name[0].isdigit(): # this condition might cause the original error to occur again!!
        i_final_int = 0 # starts at zero to prevent index out of range error from occuring at var_name[i_final_int].isdigit() in the while statement
        
        while ( var_name[i_final_int].isdigit() ) and ( i_final_int < len(var_name) -1 ):
            i_final_int += 1
        
        if i_final_int > 0:
            var_name = num2words( var_name[ : i_final_int] ) + var_name[i_final_int : ]
            
    return var_name

## Automate creation of Cypher CREATE query

In [11]:
def create_entity_query(augmented_texts: list, suppress_warnings: bool=True):
    """
    Creates a Cypher query that creates the entity part of a Neo4j Database 
    using the entities from the augmented_texts list.
    
    Inputs:
    - augmented_texts: list containing the annotated texts that have to be converted into a KG
    - suppress_warnings: If True, the warnings about texts that the function skips over will be suppressed.

    Assumptions: 
    - If entity tags are nested, this is never with a depth higher than 1. This means that the
        nested entities should always be of the shape "[TAG_1]entity1_pt1 [TAG_2]entity2[TAG2] entity1_pt1[TAG_1]"
    
    Output: 
    - Each query-line has the following format: (variable_name:ENTITY_TAG{name: 'name'})
    - In this format, each variable_name is completely decapitalized and all 
        whitespaces have been replaced by underscores.
    - the name that is within the curly brackets of each entity has each individual word capitalized, 
        but that is the only preprocessing performed on it.
    """
    # initialise a set to which all query-strings are added, and that can later be concatenated into a single string.
    # This helps preventing duplicates
    query_set = set()
    # initialise variable that prevents a closing tag from being investigated by skipping the next iteration of the current for loop if it equals True
    skip_iteration = False 
    # initialise variable that skips over an entire nested tag structure after it has been analyzed
    skip_nested_structure = 0 

    for text in augmented_texts:

        # Find all of the opening brackets in the text
        brackets_open = [m.start() for m in re.finditer("\\[", text)]
        # Find all of the closing brackets in the text
        brackets_close = [m.start() for m in re.finditer("]", text)]

        if (len(brackets_open) % 2 > 0) or (len(brackets_close) % 2 > 0):
            if not suppress_warnings:
                print(f"An uneven number of brackets has been found! Namely length of BO = {len(brackets_open)} or length of BC = {len(brackets_close)}")
            continue
            
        if len(brackets_open) != len(brackets_close):
            if not suppress_warnings:
                print(f"There is an unequal number of opening and closing brackets! There is {len(brackets_open)} opening brackets, and {len(brackets_close)} closing brackets.")
            continue

        # Loop over each tag to create query lines for the entities they annotate, while accounting for nested tags.
        for i in range( len(brackets_open) -1 ): # -1 is to prevent the function from finding a next pair of brackets when arriving at the final pair of brackets

            # skip this iteration if the previous non-nested (opening) tag has already been investigated
            if skip_iteration:
                skip_iteration = False
                continue
            
            # skip this iteration if the current tag is part of an already investigated nested structure
            if skip_nested_structure:
                skip_nested_structure -= 1
                continue

            io = brackets_open[i] # index of opening bracket of the entity tag that is currently being looped over
            ic = brackets_close[i] # index of closing bracket of the entity tag that is currently being looped over
            opening_tag = text[io+1:ic] # find the opening tag

            # initialise the variables that are required when working with a nested tag
            io_next = -1 # index of the opening bracket of the tag that is under investigation in the while statement
            ic_next = -1 # index of the closing bracket of the tag that is under investigation in the while statement
            ie_tags = [] # initialise a list where the indices of the brackets of the inner entities' tags can be stored
                         # ie_tags[0][0] = opening bracket index of ie opening tag; 
                         # ie_tags[1][1] = closing bracket index of ie closing tag
            closing_tag = None # initialise the closing tag that we are trying to find
            next_tag = None # initialise the variable that finds the closing tag 
            # initialise variable that keeps track if the tag that is currently under investigation is a nested one
            is_nested = -1 # current tag is nested if is_nested > 0
            j = i+1 # initialise index to use in the while statement 
            
            while ( next_tag != opening_tag ) and ( j < len(brackets_open) - 1 ) :
                closing_tag = next_tag # Set the closing tag to the tag that was investigated in the previous loop of the while statement
                io_next = brackets_open[j] 
                ic_next = brackets_close[j]
                next_tag = text[io_next+1:ic_next]

                # Remember the indices of the brackets of the inner entities' tags
                ie_tags.append( [io_next, ic_next] ) 

                is_nested += 1
                j += 1

            if is_nested > 2 :
                if not suppress_warnings:
                    print(f"A tag has been encountered that is nested on more than a single level! It is located in between indices {io_next} and {ic_next}.")
                continue

            if is_nested > 0: # if the current tag is a nested tag
                # Explanation of used terminology: the inner entity (ie) is the entity that is nested within the outer entity (oe)

                # find the indices of the inner entity (ie)
                begin_ie = ie_tags[0][1] + 1 # index of the first letter of the inner entity
                end_ie = ie_tags[1][0] - 1 # index of the last letter of the inner entity
                
                # extract the inner entity (ie)
                inner_entity = text[ begin_ie: end_ie ].strip() 
                ie_variable = inner_entity.replace(" ", "_").lower() 
                ie_variable = fix_var_name(ie_variable)
                inner_entity = inner_entity.lower().title()
                # inner_entity = fix_var_name(inner_entity)
                # if the unhandable case happens where an entity has length 1, continue
                if not inner_entity:
                    continue

                # find the indices of the outer entity (oe)
                begin_oe = io + (len(opening_tag) + 2) # index of the first letter of the outer entity
                end_oe = brackets_open[i+3] - 2 # index of the last letter of the outer entity
                ie_ot = brackets_open[i+1] - 1 # index of the ie opening tag (=ie_ot)
                ie_ct = brackets_close[i+2] + 1 # index of the ie closing tag (=ie_ct)
                # extract the outer entity (oe)
                outer_entity = text[ begin_oe : ie_ot ] + inner_entity + text[ ie_ct : end_oe + 1 ].strip() 
                oe_variable = outer_entity.replace(" ", "_").lower()
                oe_variable = fix_var_name(oe_variable)
                outer_entity = outer_entity.lower().title()
                # outer_entity = fix_var_name(outer_entity)

                # if the unhandable case happens where an entity has length 1, continue
                if (not ie_variable) or (not oe_variable): # or (not outer_entity):
                    continue

                # Create a Cypher-queriable line that can be used to add this entity to the KG
                query_line_oe = f"({unidecode(oe_variable)}:{opening_tag}{{name: '{unidecode(outer_entity)}'}})" 
                query_set.add(query_line_oe) # add the query line to the set of all query lines that are going to be added to the KG
                query_line_ie = f"({unidecode(ie_variable)}:{closing_tag}{{name: '{unidecode(inner_entity)}'}})" 
                query_set.add(query_line_ie) # add the query line to the set of all query lines that are going to be added to the KG

                # Make sure the current for loop skips over the remaining 3 tags from the current nested structure
                skip_nested_structure = 3 

            else: # if the current tag is not a nested one

                # extracting the name of the entity 
                begin_entity = io + (len(opening_tag) + 2) # index of the first letter of the entity
                end_entity = brackets_close[i+1] - (len(opening_tag) + 2) # index of the last letter of the entity
                entity = text[ begin_entity : end_entity  ].strip()
                entity_variable = entity.replace(" ", "_").lower()
                entity_variable = fix_var_name(entity_variable)
                # if the unhandable case happens where an entity has length 1, continue
                if not entity_variable:
                    continue
                entity = entity.lower().title()
                # entity = fix_var_name(entity)
                # # if the unhandable case happens where an entity has length 1, continue
                # if not entity:
                #     continue

                # Create a Cypher-queriable line that can be used to add this entity to the KG
                query_line = f"({unidecode(entity_variable)}:{opening_tag}{{name: '{unidecode(entity)}'}})" 
                # print('query_line:', query_line)
                query_set.add(query_line) # add the query line to the set of all query lines that are going to be added to the KG

                # skip the next iteration of the current for loop to prevent a closing tag from being investigated
                skip_iteration = True

    return query_set

In [12]:
# Running create_entity_query on the first 10 texts of the dataset
query_set_NER = create_entity_query(final_texts, suppress_warnings=True)

In [13]:
# Show results
print(len(query_set_NER))
print(query_set_NER)

9
{"(brentford:CLUB{name: 'Brentford'})", "(dutch:NATIONALITY{name: 'Dutch'})", "(flekken:PLAYER{name: 'Flekken'})", "(netherlands:COUNTRY{name: 'Netherlands'})", "(goalkeeper:POSITION{name: 'Goalkeeper'})", "(roy:PLAYER{name: 'Roy'})", "(mark_maria_hubertus_flekken:PLAYER{name: 'Mark Maria Hubertus Flekken'})", "(thirteen_june_1993:BIRTHDATE{name: '13 June 1993'})", "(his:REFERENCE{name: 'His'})"}


## Create the Knowledge Graph in Neo4j

### Implement the RE part of the query

In [14]:
# import the output from the RE model
RE_data_path = "C:/Users/guusj/Documents/AAA_Master_DSAI/Y2Q1/2AMM30_Text_Mining/DATA/test_RE_output_flekken"
# RE_data_path = "./test_RE_output_flekken"
with open(RE_data_path, "rb") as fp:
    re_output = pickle.load(fp)

# preprocess the variable names of the entities within data. 
# The variable names of the relationships already exist in the correct form
for i in range(len(re_output)):
    
    old_relation = re_output[i]

    # perform preprocessing
    new_relation_0 = old_relation[0].strip().replace(" ", "_").lower()
    new_relation_2 = old_relation[2].strip().replace(" ", "_").lower()

    # Make sure the imported variable names comply with the naming restrcitions of Cypher
    new_relation_0 = fix_var_name(new_relation_0)
    new_relation_2 = fix_var_name(new_relation_2)

    # Combine all new entries into a tuple and use it to replace the old tuple
    new_relation = (new_relation_0, old_relation[1], new_relation_2)
    re_output[i] = new_relation

# convert the re_output list into a set
re_output = set(re_output)
# show results
# re_output

In [15]:
def create_relation_query(relations_set: set):
    """
    Creates a Cypher query that creates the relation/RE part of a Neo4j Database 
    
    Assumptions: 
    - relations_set is of shape: set((SUBJECT, relationship, OBJECT), ... ), where SUBJECT and OBJECT are variables 
        referring to entities and are contained within a tuple.
    - THE SUBJECTS, RELATIONSHIPS, AND OBJECTS HAVE THE SAME SHAPE/FORMAT AS IN THE NER MODEL, so
        - all have been unidecoded
        - subjects and objects have been stripped of leading and trailing whitespaces, are 
            completely in lowercase, and all leftover whitespaces have been replaced by underscores
    """
    # initialise a set to which all query-strings are added, and that can later be concatenated into a single string.
    # This helps preventing duplicates
    query_set = set()

    for s in relations_set:
        
        subj = s[0].replace("'", "")
        # print(s)
        relation = s[1]
        obj = s[2].replace("'", "")

        query_line = f"({ unidecode(subj) })-[:{ relation }]->({ unidecode(obj) })" 
        query_set.add(query_line) # add the query line to the set of all query lines that are going to be added to the KG

    return query_set

In [16]:
query_set_RE = create_relation_query(re_output)
# Show results
query_set_RE

{'(mark_maria_hubertus_flekken)-[:born]->(thirteen_june_1993)',
 '(mark_maria_hubertus_flekken)-[:has_nationality]->(dutch)',
 '(mark_maria_hubertus_flekken)-[:originates_from]->(netherlands)',
 '(mark_maria_hubertus_flekken)-[:played_for]->(alemannia_aachen)',
 '(mark_maria_hubertus_flekken)-[:played_for]->(flekken)',
 '(mark_maria_hubertus_flekken)-[:played_for]->(greuther_furth)',
 '(mark_maria_hubertus_flekken)-[:plays_as]->(goalkeeper)',
 '(mark_maria_hubertus_flekken)-[:plays_for]->(brentford)'}

## Create the Final Query

In [17]:
def combine_queries(query_set_NER: set, query_set_RE: set):
    """
    Concatenate all of the query lines containing the named entities together with the query lines
    containing the relationships, in order to form one final Cypher query that can be ran to 
    create the KG.
    """
    # Concatenate all of the query lines to form one final Cypher query that creates the KG
    cqlCreate = """CREATE"""

    # add the entities from the NER set
    for i, line1 in enumerate(query_set_NER):
        if not i: # the first query entry should not be seperated from the CREATE statement with a comma
            cqlCreate = cqlCreate + ' \n' + line1
        else:
            cqlCreate = cqlCreate + ',\n' + line1

    # add the relationships from the RE set
    for j, line2 in enumerate(query_set_RE):
        if j == (len(query_set_RE)-1): # The final entry of the query should end with a semicolon
            cqlCreate = cqlCreate + ',\n' + line2 + ';'
        else:
            cqlCreate = cqlCreate + ',\n' + line2

    return cqlCreate

In [18]:
cqlCreate = combine_queries(query_set_NER, query_set_RE)
# Show results
cqlCreate

"CREATE \n(brentford:CLUB{name: 'Brentford'}),\n(dutch:NATIONALITY{name: 'Dutch'}),\n(flekken:PLAYER{name: 'Flekken'}),\n(netherlands:COUNTRY{name: 'Netherlands'}),\n(goalkeeper:POSITION{name: 'Goalkeeper'}),\n(roy:PLAYER{name: 'Roy'}),\n(mark_maria_hubertus_flekken:PLAYER{name: 'Mark Maria Hubertus Flekken'}),\n(thirteen_june_1993:BIRTHDATE{name: '13 June 1993'}),\n(his:REFERENCE{name: 'His'}),\n(mark_maria_hubertus_flekken)-[:played_for]->(alemannia_aachen),\n(mark_maria_hubertus_flekken)-[:played_for]->(flekken),\n(mark_maria_hubertus_flekken)-[:originates_from]->(netherlands),\n(mark_maria_hubertus_flekken)-[:has_nationality]->(dutch),\n(mark_maria_hubertus_flekken)-[:plays_for]->(brentford),\n(mark_maria_hubertus_flekken)-[:played_for]->(greuther_furth),\n(mark_maria_hubertus_flekken)-[:born]->(thirteen_june_1993),\n(mark_maria_hubertus_flekken)-[:plays_as]->(goalkeeper);"

## Create the Knowledge Graph in Neo4j using Cypher

In [19]:
# Database Credentials

uri = "bolt://localhost:7687" # Click the copy button in the "Bolt port" row from the table that appears when you click NBA_example in Neo4j Desktop
userName = "neo4j"
password = "bilalbroski1" # password for Text_Mining_Neo4j DBMS

In [20]:
# Connect to the neo4j database server
graphDB_Driver = GraphDatabase.driver(uri, auth=(userName, password))

In [21]:
# variable that should be set to "True" when the CREATE query has to be run
# The reason why this variable exists, is that if you run the code multiple times 
# with create_DB set to True, there will be a lot of duplicates within Neo4j
create_DB = True

In [22]:
# Create a few queries to test the Knowledge Graph with after it has been created

# CQL (=Cypher Query Language) to query all players that played for the Dutch national team
cqlNationalTeamQuery = """MATCH (player:PLAYER) -[:played_for] -> (country:COUNTRY) 
WHERE country.name = "Netherlands"
RETURN player.name
"""
# CQL (=Cypher Query Language) to query all players that play as goalkeeper
cqlGoalkeeperQuery = """MATCH (player:PLAYER) -[:plays_as] -> (position:POSITION) 
WHERE position.name = "Goalkeeper"
RETURN player.name
"""

In [23]:
# Execute the CQL query to create the KG
if create_DB:
    with graphDB_Driver.session() as graphDB_Session:
        # Create nodes
        graphDB_Session.run(cqlCreate)

# Show the query that we ran again
cqlCreate

"CREATE \n(brentford:CLUB{name: 'Brentford'}),\n(dutch:NATIONALITY{name: 'Dutch'}),\n(flekken:PLAYER{name: 'Flekken'}),\n(netherlands:COUNTRY{name: 'Netherlands'}),\n(goalkeeper:POSITION{name: 'Goalkeeper'}),\n(roy:PLAYER{name: 'Roy'}),\n(mark_maria_hubertus_flekken:PLAYER{name: 'Mark Maria Hubertus Flekken'}),\n(thirteen_june_1993:BIRTHDATE{name: '13 June 1993'}),\n(his:REFERENCE{name: 'His'}),\n(mark_maria_hubertus_flekken)-[:played_for]->(alemannia_aachen),\n(mark_maria_hubertus_flekken)-[:played_for]->(flekken),\n(mark_maria_hubertus_flekken)-[:originates_from]->(netherlands),\n(mark_maria_hubertus_flekken)-[:has_nationality]->(dutch),\n(mark_maria_hubertus_flekken)-[:plays_for]->(brentford),\n(mark_maria_hubertus_flekken)-[:played_for]->(greuther_furth),\n(mark_maria_hubertus_flekken)-[:born]->(thirteen_june_1993),\n(mark_maria_hubertus_flekken)-[:plays_as]->(goalkeeper);"

In [24]:
# Execute all other CQL queries and print the results
with graphDB_Driver.session() as graphDB_Session:
    # Query the graph #1
    dutch_players = graphDB_Session.run(cqlNationalTeamQuery)

    print("Names of all football players that have played for the Dutch national team:")
    for player in dutch_players:
        print(player)

    # Query the graph #2
    goalkeepers = graphDB_Session.run(cqlGoalkeeperQuery)

    print('\n')
    print("Names of players that play as goalkeepers:")
    for player in goalkeepers:
        print(player)

Names of all football players that have played for the Dutch national team:


Names of players that play as goalkeepers:
<Record player.name='Mark Maria Hubertus Flekken'>


# Appendix with helpful explanations / code:

### List of helpful Cypher commands:
- To show the complete KG: 
    - MATCH (n) RETURN n
- To delete the complete KG:
    - MATCH (n) DETACH DELETE n

For an example of what a Python query to create a KG in Cypher/Neo4j should look like: https://github.com/harblaith7/Neo4j-Crash-Course/blob/main/01-initial-data.cypher

### Find index:

If you want to know which index a certain text has within the dataset,you can enter the name that starts the text as a string to the begin_str variable.


The following code will then print the index of the text that begins like this when you run it:

In [25]:
# Define the number of players/texts
nr_texts = 1031 # set to max number of texts (1031) since you want to search all of these

# begin_str = "Thomas Alun Lockyer"
begin_str = "Mark Maria Hubertus Flekken"

for i in range(nr_texts):
    with open(f'./data/footballer_{i}.txt', 'r') as f:
        contents = f.read()
        # texts.append(contents)
        if (contents[:len(begin_str)] == begin_str):
            print(f"The index of the text that starts with '{begin_str}' is: ", i)
    f.close()

The index of the text that starts with 'Mark Maria Hubertus Flekken' is:  369
