# Creating a Knowledge Graph from the output of the NER & RE models
This time, using a SpaCy implementation.

## For Component 2

### Instructions on how to work with Neo4j alongside this code

Before you start, download neo4j desktop from https://neo4j.com/download/

Then, create a project called "Text_Mining" (or whatever name you want, since this name is not used within this code). 

Click "Add" -> "Local DMBS" and name it "Text_Mining_Neo4j" (even though this name is also not used within this code) while setting the password to "bilalbroski1" (IMPORTANT!)

Then you have to click the blue "Start" button and wait for the server to start. After it is done loading, click the blue "Open" button. This will cause the Neo4j Browser to open. 

In Neo4j browser, run the command "MATCH (n) RETURN n" to show the complete KG, and run "MATCH (n) DETACH DELETE n" to completely empty the DMBS.

If any of this does not work, follow the steps in the "Download Neo4j" section of https://youtu.be/8jNPelugC2s?si=QV898-ggdLIq9XPk&t=597

In [1]:
import re
import os, os.path

from neo4j import GraphDatabase
from unidecode import unidecode
from num2words import num2words
import pickle
# from copy import deepcopy
import json

# # imports for loading the NER model using flair
# from flair.data import Sentence
# from flair.models import SequenceTagger

# import for loading the NER model using SpaCy
import spacy

In [2]:
# # # load the NER model using FLAIR
# # custom_ner_model = SequenceTagger.load(r'flair_models\best-model.pt')

# # Load the NER model using SpaCy
# custom_ner_model = spacy.load("model-best")

In [3]:
# modified verison of what Bilal wrote on 26-01-2024
def augmented_text(data):
    augmented_data = []
    for example in data:
        old_text = example['text']
        new_text = ''
        cur_index = 0
        for entity in example['entities']:
            start = entity[0]
            end = entity[1]
            label = entity[2]

            if cur_index < start:
                new_text += old_text[cur_index:start]
                new_text += f'[{label}]{old_text[start:end]}[{label}]'
                cur_index = end
            else:
                new_text += f'[{label}]{old_text[start:end]}[{label}]'
                cur_index = end
            
        augmented_data.append(new_text)
    
    return augmented_data

In [4]:
# Create empty list to which the text(s) can be added
texts = []

# only execute if manually_select_texts is set to True
manually_select_texts = False

if manually_select_texts:

    # Set parameters for loading texts

    # set to True in case you want to select a single specific text
    select_text = True

    # index of the specific text that you want to select
    text_i = 369  # index of text for Thomas Alun Lockyer = 192; Mark Maria Hubertus Flekken = 369

    # if select_text is set to False, there will be nr_texts selected
    nr_texts = 2 # max number of texts is 1031

    # import the texts that have to be converted into a KG (=Knowledge Graph)

    texts = []

    if select_text: # only select the text with index text_i
        with open(f'./data/footballer_{text_i}.txt', 'r') as f:
            contents = f.read()
            texts.append(contents)
        f.close()
        print(f"The text with index number {text_i} has successfully been imported")

    else: # select the first nr_texts texts
        for i in range(nr_texts):
            with open(f'./data/footballer_{i}.txt', 'r') as f:
                contents = f.read()
                texts.append(contents)
            f.close()

        print(f"{len(texts)} texts have been successfully imported")

    print(texts)

else:
    text_ids = []

    with open(r'predictions_C2.json','r') as data_file:    
        data = json.load(data_file)
        # for v in data:#[0].values():
        #     if v['id'] not in text_ids:
        #         # aug_text = augmented_text(v['text'])
        #         texts.append(v['text'])
        #         text_ids.append(v['id'])
        texts = augmented_text(data)

In [5]:
texts

["On [DATE]2/18/87,[DATE] at [TIME]0001 hours[TIME], during normal steady state operation, ([LOCATION]Mode 1[LOCATION], at 100 percent power) and no rod motion in progress, [EMPLOYEE]the Control Room operators[EMPLOYEE] observed a decrease in reactor power and that No. 2 control rod's position lights indicated that it had dropped. [EMPLOYEE]The operators[EMPLOYEE] immediately began to reduce generator load to restore the main coolant system Tave per plant procedure. At 0003 hours the reactor protection system initiated an automatic scram as the result of a [UNEXPECTED EVENT]high main coolant pressure condition[UNEXPECTED EVENT]. The [UNEXPECTED EVENT]high main coolant pressure[UNEXPECTED EVENT] occurred because the load reduction by the [EMPLOYEE]operator[EMPLOYEE] overcompensated for the power reduction from the dropped rod. The NRC was notified via the ENS at [TIME]0101 hours[TIME] [DATE]February 18, 1987[DATE].  <br /><br />The root cause of this event was determined to be [CAUSE]a 

## Preprocessing

In [6]:
def escapeRegExp(text):
    """
    Replace all of the opening brackets with "\[" to escape the special characters in the regrex.
    This is required to prevent unterminated character sets.
    If the unterminated character sets error still occurs, try running the commented out lines as well.
    Source: https://stackoverflow.com/questions/54135606/python-re-error-unterminated-character-set-at-position
    """
    edited_text = text.replace("[", "\[")
    # edited_text = edited_text.replace("{", "\{")
    # edited_text = edited_text.replace("(", "\(")
    # edited_text = edited_text.replace(")", "\)")
    return edited_text

In [7]:
final_texts = map(escapeRegExp, texts) # where the "texts" variable contains the augmented texts

In [8]:
def fix_var_name(var_name: str):
    """
    Check string for violations of the Cypher variable name conventions,
    namely that it must always start with a letter.
    If the string starts with an integer, replace it with the word for this integer.
    For example, 2_october becomes two_october.
    Remove apostrophes
    The handleable_case parameter is set to True if the input is a handleable case. If not,
    handleable_case will be set to False, which will lead to the abortion of the current iteration
    in which fix_var_name() is being used.
    """
    handleable_case = True # The case is assumed to be handleable

    # if the case is unhandable, handleable_case will be set to False
    if len(var_name)==1: # for variable names with a length of 1 
        handleable_case = False
    # if len(var_name) > 30: # for variable names that are extremely long
    #     handleable_case = False 

    if len(var_name) > 0 and var_name[0].isdigit(): # this condition might cause the original error to occur again!!
        i_final_int = 0 # starts at zero to prevent index out of range error from occuring at var_name[i_final_int].isdigit() in the while statement
        
        while ( var_name[i_final_int].isdigit() ) and ( i_final_int < len(var_name) -1 ):
            i_final_int += 1
        
        if i_final_int > 0:
            var_name = num2words( var_name[ : i_final_int] ) + var_name[i_final_int : ]
            var_name = var_name.replace(" ", "_").replace("-", "_").lower() 
            
    return var_name, handleable_case

## Automate creation of Cypher CREATE query

In [9]:
def clean_str(entity_str: str):
    """
    Takes a string that represents an entity in the create_entity_query function
    and removes symbols that cause problems when creating a Neo4j query.

    Also look at the if statement ("if i_final_int > 0") within the fix_var_name() function.
    """
    entity_str = entity_str.replace("-", "_")

    entity_str = entity_str.replace(" ", "_").lower()
    
    entity_str = entity_str.replace("'", "")
    entity_str = entity_str.replace('"','')
    entity_str = entity_str.replace(".", "")
    entity_str = entity_str.replace(",", "")
    entity_str = entity_str.replace(";", "")
    entity_str = entity_str.replace("(", "")
    entity_str = entity_str.replace(")", "")
    entity_str = entity_str.replace("&", "")
    entity_str = entity_str.replace("\\", "")
    entity_str = entity_str.replace("[", "")
    entity_str = entity_str.replace("]", "")

    entity_str = entity_str.strip("_")

    return entity_str

In [10]:
def modify_str(entity_str: str):
    """
    Takes the extracted string in the create_entity_query function, and modifies it
    according to the naming conventions of neo4j.
    """
    variable_str = entity_str.replace(" ", "_").lower() 
    variable_str, handleable = fix_var_name(variable_str)
    entity_str = entity_str.lower().title()

    # entity_str = entity_str.replace('"','').replace("'","")

    # apply clean_str() function
    variable_str = clean_str(variable_str)
    entity_str = clean_str(entity_str)

    return variable_str, entity_str, handleable

In [11]:
def create_entity_query(augmented_texts: list, suppress_warnings: bool=True):
    """
    Creates a Cypher query that creates the entity part of a Neo4j Database 
    using the entities from the augmented_texts list.
    
    Inputs:
    - augmented_texts: list containing the annotated texts that have to be converted into a KG
    - suppress_warnings: If True, the warnings about texts that the function skips over will be suppressed.

    Assumptions: 
    - If entity tags are nested, this is never with a depth higher than 1. This means that the
        nested entities should always be of the shape "[TAG_1]entity1_pt1 [TAG_2]entity2[TAG2] entity1_pt1[TAG_1]"
    
    Output: 
    - Each query-line has the following format: (variable_name:ENTITY_TAG{name: 'name'})
    - In this format, each variable_name is completely decapitalized and all 
        whitespaces have been replaced by underscores.
    - the name that is within the curly brackets of each entity has each individual word capitalized, 
        but that is the only preprocessing performed on it.
    """
    # initialise a set to which all query-strings are added, and that can later be concatenated into a single string.
    # This helps preventing duplicates
    query_set = set()
    # Initialise a set that keeps track of the node labels. This prevents the following error from occuring:
    # """ {message: Variable `akanji` already declared (line 256, column 2 (offset: 10988)) """ 
    node_label_set = set()
    # initialise variable that prevents a closing tag from being investigated by skipping the next iteration of the current for loop if it equals True
    skip_iteration = False 
    # initialise variable that skips over an entire nested tag structure after it has been analyzed
    skip_nested_structure = 0 

    for text in augmented_texts:

        # Find all of the opening brackets in the text
        brackets_open = [m.start() for m in re.finditer("\\[", text)]
        # Find all of the closing brackets in the text
        brackets_close = [m.start() for m in re.finditer("]", text)]

        if (len(brackets_open) % 2 > 0) or (len(brackets_close) % 2 > 0):
            if not suppress_warnings:
                print(f"An uneven number of brackets has been found! Namely length of BO = {len(brackets_open)} or length of BC = {len(brackets_close)}")
            continue
            
        if len(brackets_open) != len(brackets_close):
            if not suppress_warnings:
                print(f"There is an unequal number of opening and closing brackets! There is {len(brackets_open)} opening brackets, and {len(brackets_close)} closing brackets.")
            continue

        # Loop over each tag to create query lines for the entities they annotate, while accounting for nested tags.
        for i in range( len(brackets_open) -1 ): # -1 is to prevent the function from finding a next pair of brackets when arriving at the final pair of brackets

            # skip this iteration if the previous non-nested (opening) tag has already been investigated
            if skip_iteration:
                skip_iteration = False
                continue
            
            # skip this iteration if the current tag is part of an already investigated nested structure
            if skip_nested_structure:
                skip_nested_structure -= 1
                continue

            io = brackets_open[i] # index of opening bracket of the entity tag that is currently being looped over
            ic = brackets_close[i] # index of closing bracket of the entity tag that is currently being looped over
            opening_tag = text[io+1:ic] # find the opening tag

            # initialise the variables that are required when working with a nested tag
            io_next = -1 # index of the opening bracket of the tag that is under investigation in the while statement
            ic_next = -1 # index of the closing bracket of the tag that is under investigation in the while statement
            ie_tags = [] # initialise a list where the indices of the brackets of the inner entities' tags can be stored
                         # ie_tags[0][0] = opening bracket index of ie opening tag; 
                         # ie_tags[1][1] = closing bracket index of ie closing tag
            closing_tag = None # initialise the closing tag that we are trying to find
            next_tag = None # initialise the variable that finds the closing tag 
            # initialise variable that keeps track if the tag that is currently under investigation is a nested one
            is_nested = -1 # current tag is nested if is_nested > 0
            j = i+1 # initialise index to use in the while statement 
            
            while ( next_tag != opening_tag ) and ( j < len(brackets_open) - 1 ) :
                closing_tag = next_tag # Set the closing tag to the tag that was investigated in the previous loop of the while statement
                io_next = brackets_open[j] 
                ic_next = brackets_close[j]
                next_tag = text[io_next+1:ic_next]

                # Remember the indices of the brackets of the inner entities' tags
                ie_tags.append( [io_next, ic_next] ) 

                is_nested += 1
                j += 1

            if is_nested > 2 :
                if not suppress_warnings:
                    print(f"A tag has been encountered that is nested on more than a single level! It is located in between indices {io_next} and {ic_next}.")
                continue

            if is_nested > 0: # if the current tag is a nested tag
                # Explanation of used terminology: the inner entity (ie) is the entity that is nested within the outer entity (oe)

                # find the indices of the inner entity (ie)
                begin_ie = ie_tags[0][1] + 1 # index of the first letter of the inner entity
                end_ie = ie_tags[1][0] - 1 # index of the last letter of the inner entity
                
                # extract the inner entity (ie)
                inner_entity = text[ begin_ie: end_ie ].strip()

                ie_variable, inner_entity, handleable_ie = modify_str(inner_entity)

                # if the unhandable case happens where an entity has length 1, continue
                # if not inner_entity:
                #     continue

                # find the indices of the outer entity (oe)
                begin_oe = io + (len(opening_tag) + 2) # index of the first letter of the outer entity
                end_oe = brackets_open[i+3] - 2 # index of the last letter of the outer entity
                ie_ot = brackets_open[i+1] - 1 # index of the ie opening tag (=ie_ot)
                ie_ct = brackets_close[i+2] + 1 # index of the ie closing tag (=ie_ct)
                # extract the outer entity (oe)
                outer_entity = text[ begin_oe : ie_ot ] + inner_entity + text[ ie_ct : end_oe + 1 ].strip() 
                
                oe_variable, outer_entity, handleable_oe = modify_str(outer_entity)

                # if the unhandable case happens where an entity has length 1, continue
                if (not handleable_ie) or (not handleable_oe):
                    continue
                
                # Define the variable names for the node labels
                oe_node_name = unidecode(oe_variable)
                oe_node_name = oe_node_name.replace("-","_")
                ie_node_name = unidecode(ie_variable)
                ie_node_name = ie_node_name.replace("-","_")
                
                # check if these variable names are already contained in the set of variable names
                if (oe_node_name in node_label_set) or (ie_node_name in node_label_set):
                    continue

                # Create a Cypher-queriable line that can be used to add this entity to the KG
                query_line_oe = f"({oe_node_name}:{opening_tag}{{name: '{unidecode(outer_entity)}'}})" 
                query_set.add(query_line_oe) # add the query line to the set of all query lines that are going to be added to the KG
                node_label_set.add(oe_node_name) # add the variable name for the node label to the set of all node labels to prevent duplicates
                query_line_ie = f"({ie_node_name}:{closing_tag}{{name: '{unidecode(inner_entity)}'}})" 
                query_set.add(query_line_ie) # add the query line to the set of all query lines that are going to be added to the KG
                node_label_set.add(ie_node_name) # add the variable name for the node label to the set of all node labels to prevent duplicates

                # Make sure the current for loop skips over the remaining 3 tags from the current nested structure
                skip_nested_structure = 3 

            else: # if the current tag is not a nested one

                # extracting the name of the entity 
                begin_entity = io + (len(opening_tag) + 2) # index of the first letter of the entity
                end_entity = brackets_close[i+1] - (len(opening_tag) + 2) # index of the last letter of the entity
                
                entity = text[ begin_entity : end_entity  ].strip()

                entity_variable, entity, handleable = modify_str(entity)

                # if the unhandable case happens where an entity has length 1, continue
                if not handleable:
                    continue

                # Define the variable name for the node label
                node_name = unidecode(entity_variable)
                node_name = node_name.replace("-","_")

                # check if this variable name is already contained in the set of variable names
                if node_name in node_label_set:
                    continue

                # Create a Cypher-queriable line that can be used to add this entity to the KG
                query_line = f"({node_name}:{opening_tag}{{name: '{unidecode(entity)}'}})" 
                query_set.add(query_line) # add the query line to the set of all query lines that are going to be added to the KG
                node_label_set.add(node_name) # add the variable name for the node label to the set of all node labels to prevent duplicates

                # skip the next iteration of the current for loop to prevent a closing tag from being investigated
                skip_iteration = True

    return query_set

In [12]:
# Running create_entity_query on the first 10 texts of the dataset
query_set_NER = create_entity_query(final_texts, suppress_warnings=True)

In [13]:
# Show results
print('length of the set = ', len(query_set_NER))
print(query_set_NER)

length of the set =  34
{"(and_reactor:EMPLOYEE{name: 'and_reactor'})", "(mode_2:LOCATION{name: 'mode_2'})", "(three/22/88:DATE{name: '3/22/88'})", "(two/18/87:DATE{name: '2/18/87'})", "(personnel_error:CAUSE{name: 'personnel_error'})", "(the_turbine_operator_at_the_turbine_pedestal_used_an_excessive_amount_of_steam_to_increase_turbine_speed:UNEXPECTED EVENT{name: 'the_turbine_operator_at_the_turbine_pedestal_used_an_excessive_amount_of_steam_to_increase_turbine_speed'})", "(a_reactor_scram_on_high_main_coolant_pressure:UNEXPECTED EVENT{name: 'a_reactor_scram_on_high_main_coolant_pressure'})", "(which_delayed_timely_compensatory_actions_<br_/><br_/>_plant_response_was_normalfollowing_the_scram_corrective_actions_were_taken_to_reinstruct:EMPLOYEE{name: 'which_delayed_timely_compensatory_actions_<br_/><br_/>_plant_response_was_normalfollowing_the_scram_corrective_actions_were_taken_to_reinstruct'})", "(operators:EMPLOYEE{name: 'operators'})", "(one_hours:TIME{name: '0001_hours'})", "(may

## Create the Knowledge Graph in Neo4j

### Implement the RE part of the query -> looping over all 50 texts

In [14]:
# # Database Credentials

# uri = "bolt://localhost:7687" # Click the copy button in the "Bolt port" row from the table that appears when you click NBA_example in Neo4j Desktop
# userName = "neo4j"
# password = "bilalbroski1" # password for Text_Mining_Neo4j DBMS

In [15]:
# # Connect to the neo4j database server
# graphDB_Driver = GraphDatabase.driver(uri, auth=(userName, password))

In [16]:
# # variable that should be set to "True" when the CREATE query has to be run
# # The reason why this variable exists, is that if you run the code multiple times 
# # with create_DB set to True, there will be a lot of duplicates within Neo4j
# create_DB = True

In [17]:
def create_relation_query(relations_set: set):
    """
    Creates a Cypher query that creates the relation/RE part of a Neo4j Database 
    
    Assumptions: 
    - relations_set is of shape: set((SUBJECT, relationship, OBJECT), ... ), where SUBJECT and OBJECT are variables 
        referring to entities and are contained within a tuple.
    - THE SUBJECTS, RELATIONSHIPS, AND OBJECTS HAVE THE SAME SHAPE/FORMAT AS IN THE NER MODEL, so
        - all have been unidecoded
        - subjects and objects have been stripped of leading and trailing whitespaces, are 
            completely in lowercase, and all leftover whitespaces have been replaced by underscores
    """
    # initialise a set to which all query-strings are added, and that can later be concatenated into a single string.
    # This helps preventing duplicates
    query_set = set()

    for s in relations_set:
        
        subj = clean_str( s[0].replace("'", "") )
        relation = s[1]
        obj = clean_str( s[2].replace("'", "") )

        query_line = f"({ unidecode(subj) })-[:{ relation }]->({ unidecode(obj) })" 
        query_set.add(query_line) # add the query line to the set of all query lines that are going to be added to the KG

    return query_set

In [18]:
def combine_queries(query_set_NER: set, query_set_RE: set):
    """
    Concatenate all of the query lines containing the named entities together with the query lines
    containing the relationships, in order to form one final Cypher query that can be ran to 
    create the KG.
    """
    # Concatenate all of the query lines to form one final Cypher query that creates the KG
    cqlCreate = """CREATE"""

    # add the entities from the NER set
    for i, line1 in enumerate(query_set_NER):
        if not i: # the first query entry should not be seperated from the CREATE statement with a comma
            cqlCreate = cqlCreate + ' \n' + line1
        else:
            cqlCreate = cqlCreate + ',\n' + line1

    # add the relationships from the RE set
    for j, line2 in enumerate(query_set_RE):
        if j == (len(query_set_RE)-1): # The final entry of the query should end with a semicolon
            cqlCreate = cqlCreate + ',\n' + line2 + ';'
        else:
            cqlCreate = cqlCreate + ',\n' + line2

    return cqlCreate

In [19]:
# cqlEmpty = """match (n) detach delete n"""

In [20]:
# initialise set to add all outputs of the RE model to
RE_output_set = set()

RE_list = []

# Loop over all 50 texts
for text_id in range(5):

    RE_sublist = []

    # import the output from the RE model
    RE_data_path = f"C:/Users/guusj/Documents/AAA_Master_DSAI/Y2Q1/2AMM30_Text_Mining/RESIT/Data Ivan/test_output_C2/file_{text_id}"
    with open(RE_data_path, "rb") as fp:
        re_output = pickle.load(fp)
        
        print('re_output = ', re_output)

        # preprocess the variable names of the entities within data. 
        # The variable names of the relationships already exist in the correct form
        for i in range(len(re_output)):
            old_relation = re_output[i]

            # perform preprocessing
            new_relation_0 = old_relation[0].strip().replace(" ", "_").lower()
            new_relation_2 = old_relation[2].strip().replace(" ", "_").lower()

            # Make sure the imported variable names comply with the naming restrcitions of Cypher
            new_relation_0, handleable0 = fix_var_name(new_relation_0)
            new_relation_0 = unidecode(new_relation_0)
            new_relation_2, handleable2 = fix_var_name(new_relation_2)
            new_relation_2 = unidecode(new_relation_2)

            if (handleable0 and handleable2):
                # Combine all new entries into a tuple and use it to replace the old tuple
                new_relation = (new_relation_0, old_relation[1], new_relation_2)                
                re_output[i] = new_relation

                # convert the re_output list into a set
                # re_output = set(re_output)
                
                # add re_output to RE_output_set
                RE_output_set.add(new_relation)
                RE_sublist.append(new_relation)
                # print('RE_sublist = ',RE_sublist)
            else:
                print("KAAAAAAAAAAAAAAAAS ", i)

    # STOP THE LOOP FOR EACH TEXT, SUCH THAT YOU CAN SCREENSHOT THE ACCOMPANYING KNOWLEDGE GRAPH IN NEO4J
    
    print()
    print()
    query_set_RE = create_relation_query(RE_output_set)
    RE_list.append(RE_sublist)
    # print(f'query_set_RE text nr {text_id}: ', query_set_RE)
    # print("RE_list = ", RE_list)
    print()
    
    # cqlCreate = combine_queries(query_set_NER, query_set_RE)
    # print(f'cqlCreate text nr {text_id}' , cqlCreate)
    # print()

    # # Execute the CQL query to create the KG
    # if create_DB:
    #     with graphDB_Driver.session() as graphDB_Session:
    #         # Create nodes
    #         graphDB_Session.run(cqlCreate)

    # input("Press Enter to continue...")
    # print('continue after text_id nr ', text_id)
    # # Execute the CQL query to empty the KG
    # if create_DB:
    #     with graphDB_Driver.session() as graphDB_Session:
    #         # Create nodes
    #         graphDB_Session.run(cqlEmpty)

re_output =  [('high main coolant pressure condition', 'caused_by', 'a procedure inadequacy'), ('high main coolant pressure', 'caused_by', 'a procedure inadequacy')]



re_output =  [('a reactor scram on high main coolant pressure', 'caused_by', 'personnel error'), ('a reactor scram on high main coolant pressure', 'caused_by', 'Plant response was normal'), ('The turbine operator (at the turbine pedestal) used an excessive amount of steam to increase turbine speed', 'caused_by', 'Plant response was normal'), ('second factor was the', 'done_by', 'operator')]



re_output =  []



re_output =  [('The reactor scram', 'happened_during', '0529 hours')]



re_output =  [('May 1988,', 'happened_during', 'Mode 1')]





In [21]:
len(RE_list)

5

In [22]:
RE_list

[[('high_main_coolant_pressure_condition',
   'caused_by',
   'a_procedure_inadequacy'),
  ('high_main_coolant_pressure', 'caused_by', 'a_procedure_inadequacy')],
 [('a_reactor_scram_on_high_main_coolant_pressure',
   'caused_by',
   'personnel_error'),
  ('a_reactor_scram_on_high_main_coolant_pressure',
   'caused_by',
   'plant_response_was_normal'),
  ('the_turbine_operator_(at_the_turbine_pedestal)_used_an_excessive_amount_of_steam_to_increase_turbine_speed',
   'caused_by',
   'plant_response_was_normal'),
  ('second_factor_was_the', 'done_by', 'operator')],
 [],
 [('the_reactor_scram',
   'happened_during',
   'five_hundred_and_twenty_nine_hours')],
 [('may_1988,', 'happened_during', 'mode_1')]]

In [23]:
# if create_DB:
#     with graphDB_Driver.session() as graphDB_Session:
#         # Create nodes
#         graphDB_Session.run(cqlEmpty)

In [24]:
# RE_output_set

In [25]:
query_set_RE = create_relation_query(RE_output_set)
# Show results
query_set_RE

{'(a_reactor_scram_on_high_main_coolant_pressure)-[:caused_by]->(personnel_error)',
 '(a_reactor_scram_on_high_main_coolant_pressure)-[:caused_by]->(plant_response_was_normal)',
 '(high_main_coolant_pressure)-[:caused_by]->(a_procedure_inadequacy)',
 '(high_main_coolant_pressure_condition)-[:caused_by]->(a_procedure_inadequacy)',
 '(may_1988)-[:happened_during]->(mode_1)',
 '(second_factor_was_the)-[:done_by]->(operator)',
 '(the_reactor_scram)-[:happened_during]->(five_hundred_and_twenty_nine_hours)',
 '(the_turbine_operator_at_the_turbine_pedestal_used_an_excessive_amount_of_steam_to_increase_turbine_speed)-[:caused_by]->(plant_response_was_normal)'}

## Create the Final Query

In [26]:
# cqlCreate = combine_queries(query_set_NER, query_set_RE)
# # Show results
# cqlCreate

## Create the Knowledge Graph in Neo4j using Cypher

In [27]:
# # Database Credentials

# uri = "bolt://localhost:7687" # Click the copy button in the "Bolt port" row from the table that appears when you click NBA_example in Neo4j Desktop
# userName = "neo4j"
# password = "bilalbroski1" # password for Text_Mining_Neo4j DBMS

In [28]:
# # Connect to the neo4j database server
# graphDB_Driver = GraphDatabase.driver(uri, auth=(userName, password))

In [29]:
# # variable that should be set to "True" when the CREATE query has to be run
# # The reason why this variable exists, is that if you run the code multiple times 
# # with create_DB set to True, there will be a lot of duplicates within Neo4j
# create_DB = True

In [30]:
# # Create a few queries to test the Knowledge Graph with after it has been created

# # CQL (=Cypher Query Language) to query all players that played for the Dutch national team
# cqlNationalTeamQuery = """MATCH (player:PLAYER) -[:played_for] -> (country:COUNTRY) 
# WHERE country.name = "Netherlands"
# RETURN player.name
# """
# # CQL (=Cypher Query Language) to query all players that play as goalkeeper
# cqlGoalkeeperQuery = """MATCH (player:PLAYER) -[:plays_as] -> (position:POSITION) 
# WHERE position.name = "Goalkeeper"
# RETURN player.name
# """

In [31]:
# cqlCreate

In [32]:
# # Execute the CQL query to create the KG
# if create_DB:
#     with graphDB_Driver.session() as graphDB_Session:
#         # Create nodes
#         graphDB_Session.run(cqlCreate)

# # Show the query that we ran again
# cqlCreate

In [33]:
# # Execute all other CQL queries and print the results
# with graphDB_Driver.session() as graphDB_Session:
#     # Query the graph #1
#     dutch_players = graphDB_Session.run(cqlNationalTeamQuery)

#     print("Names of all football players that have played for the Dutch national team:")
#     for player in dutch_players:
#         print(player)

#     # Query the graph #2
#     goalkeepers = graphDB_Session.run(cqlGoalkeeperQuery)

#     print('\n')
#     print("Names of players that play as goalkeepers:")
#     for player in goalkeepers:
#         print(player)

# Appendix with helpful explanations / code:

### List of helpful Cypher commands:
- To show the complete KG: 
    - MATCH (n) RETURN n
- To delete the complete KG:
    - MATCH (n) DETACH DELETE n

For an example of what a Python query to create a KG in Cypher/Neo4j should look like: https://github.com/harblaith7/Neo4j-Crash-Course/blob/main/01-initial-data.cypher

### Find index:

If you want to know which index a certain text has within the dataset,you can enter the name that starts the text as a string to the begin_str variable.


The following code will then print the index of the text that begins like this when you run it:

In [34]:
# # Define the number of players/texts
# nr_texts = 1031 # set to max number of texts (1031) since you want to search all of these

# # begin_str = "Thomas Alun Lockyer"
# begin_str = "Mark Maria Hubertus Flekken"

# for i in range(nr_texts):
#     with open(f'./data/footballer_{i}.txt', 'r') as f:
#         contents = f.read()
#         # texts.append(contents)
#         if (contents[:len(begin_str)] == begin_str):
#             print(f"The index of the text that starts with '{begin_str}' is: ", i)
#     f.close()