# Creating Knowledge Graphs using Pretrained Models 

This notebook takes the 2 models developed in steps 3 and 4 and uses them to run across the full body of texts. First, the named entity extraction model is used to extract the entities, while the relation extractor then classifyies the most likely relationship between these two entities. These are used to pull the information that is then fed into Neo4j to create the knowledge graphs. As neo4j cannot connect well with servers, the knowledge graph code can be found in the notebook 'Creating_Knowledge_Graphs.ipynb'. The data extracted through this notebook is therefore stored as pickle objects and imported to the colab notebook. All copies of the data can be inspected from the 4_Knowledge_Graphs/output <br>

The notebook is split into the following sections: 
1. Named Entity Recognition 
2. Relation Extraction 

## Setup 

In [68]:
# -- General dependencies -- 
import pandas as pd
import json
import csv
from tqdm import tqdm
import pickle

# -- Spacy functions -- 
import spacy
import hashlib

__Load in the Data__

In [2]:
# -- Import coreferenced texts -- 
psychedelic_articles = pd.read_csv('../KGs/1_coreference_resolution/data/coreferenced_full_texts.csv')

In [3]:
# -- inspect -- 
psychedelic_articles[:5]

Unnamed: 0,DOI,Discussion,Conclusions,Coreferenced_Discussion,Coreferenced_Conclusions
0,10.1177/0269881116675513,The present study demonstrated the efficacy of...,When administered under psychologically suppor...,The present study demonstrated the efficacy of...,When administered under psychologically suppor...
1,10.1001/archgenpsychiatry.2010.116,The initial goals of this research project wer...,,The initial goals of this research project wer...,
2,10.1124/pr.115.011478,"Dr. Albert Hofmann, the natural products chemi...",,"Dr. Albert Hofmann, the natural products chemi...",
3,10.1016/S2215-0366(16)30065-7,"In this open-label, single-arm pilot study, we...",,"In this open-label, single-arm pilot study, we...",
4,10.1073/pnas.1119598109,The fMRI studies reported here revealed signif...,,The fMRI studies reported here revealed signif...,


__Preprocess the Data__

In [4]:
# --- Extract the Discussion column ---  
discussions = []

for index, row in psychedelic_articles.iterrows():
    discussions.append(str(row[3]))
    
print(f"The discussions have been extracted and are now a type: {type(discussions)}.")

The discussions have been extracted and are now a type: <class 'list'>.


In [5]:
discussions = discussions[:]

In [6]:
# -- inspect the first discussion -- 
discussions[:1]

['The present study demonstrated the efficacy of a high dose of psilocybin administered under supportive conditions to decrease symptoms of depressed mood and anxiety, and to increase quality of life in patients with a life-threatening cancer diagnosis. Eleven of 17 therapeutically relevant measures fulfilled conservative criteria for demonstrating efficacy of the high dose of psilocybin (Table 4, Figure 3). The data show that psilocybin produced large and significant decreases in clinician-rated and self-rated measures of depression, anxiety or mood disturbance, and increases in measures of quality of life, life meaning, death acceptance, and optimism. These effects were sustained at 6 months. For the clinician-rated measures of depression and anxiety, respectively, the overall rate of clinical response at 6 months was 78% and 83% and the overall rate of symptom remission was 65% and 57%. Participants attributed to the high-dose experience positive changes in attitudes about life, sel

## Step 1:  Named Entity Recognition Model 

In [7]:
# -- Load our fine-tuned named entity model --
nlp = spacy.load('../KGs/model-best')

In [8]:
# -- Test on one sentence -- 
text = ['This sentence tests whether our NER model can capture psychedelic drugs such as psilocybin, ayahuasca, DMT, and LSD, and also our health conditions such as depression, anxiety, and PTSD. It also looks for outcomes such as positive outcome, reduces depressive symptoms, and perhaps negative outcomes like increased anxiety']

In [9]:
for doc in nlp.pipe(text, disable=["tagger", "parser"]):
    print([(ent.text, ent.label_) for ent in doc.ents])

[('psychedelic drugs', 'DRUG'), ('psilocybin', 'DRUG'), ('ayahuasca', 'DRUG'), ('DMT', 'DRUG'), ('LSD', 'DRUG'), ('depression', 'HEALTH'), ('anxiety', 'HEALTH'), ('PTSD', 'HEALTH'), ('positive outcome', 'OUTCOME'), ('reduces depressive symptoms', 'OUTCOME'), ('anxiety', 'HEALTH')]




_Here we see that the model is able to capture our drugs, health conditions, and some outcomes, but incorrectly labels the 'increased anxiety' as we might expect given that it works with accuracy of around 65%._ 

## Run NER across all discussion texts

__Extract entities with all their positional information__

In [10]:
# -- Function to extract entities positional information -- 
def extract_ents(documents,nlp):
    #create an empty list object 
    docs = list()
    
    #for each document (discussion) in the model's pipeline, extract the named entities (ignore tagging and parsing)
    for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
        #save the information into a temporary dictionary 
        dictionary=dict.fromkeys(["text", "annotations"])
        #Make the text column into string variables 
        dictionary["text"]= str(doc)
        #use hashlib functionality to make sense of the encrypted texts 
        dictionary['text_sha256'] =  hashlib.sha256(dictionary["text"].encode('utf-8')).hexdigest()
        
        # Create an empty list for annotations from this document to be saved into 
        annotations=[]
        
        #for every entity in the document 
        for e in doc.ents:
            #extract the entity text (i.e., which word has the model picked up on) 
            ent_id = hashlib.sha256(str(e.text).encode('utf-8')).hexdigest()
            #extract the entity labelt he model has given this word
            ent = {"start":e.start_char,"end":e.end_char, "label":e.label_,"label_upper":e.label_.upper(),"text":e.text,"id":ent_id}
            #append these to the annotations list to have text, labels, and positional information 
            annotations.append(ent)
        
        #Add our annotations to the dictionary object we've created and return dictionary 
        dictionary["annotations"] = annotations
        docs.append(dictionary)
        
    return docs

In [12]:
# -- Run across all the discussions -- 
parsed_ents = extract_ents(discussions,nlp)

In [179]:
print(f"The parsed entities are type: {type(parsed_ents)}.")
print(f"The columns within parsed ents are type: {type(parsed_ents[0])}.")
print(f"The parsed entities are of length: {len(parsed_ents)}.") # this should be 84 for our 84 documents 

The parsed entities are type: <class 'list'>.
The columns within parsed ents are type: <class 'dict'>.
The parsed entities are of length: 84.


In [15]:
# --- inspect instance of output ---
parsed_ents[0]['annotations']

[{'start': 62,
  'end': 72,
  'label': 'DRUG',
  'label_upper': 'DRUG',
  'text': 'psilocybin',
  'id': '55b804e1155b33b0b6cb16e81ec7dd406e046cb731d4e0334accd777990d9c95'},
 {'start': 138,
  'end': 152,
  'label': 'HEALTH',
  'label_upper': 'HEALTH',
  'text': 'depressed mood',
  'id': 'df85a0d5d82bc34ffe98dd5ded6e904975b70b9b4612bbd3890265035e30a327'},
 {'start': 157,
  'end': 164,
  'label': 'HEALTH',
  'label_upper': 'HEALTH',
  'text': 'anxiety',
  'id': 'ab0231f72dd9f07809f05d962bfc823a935461a2f194cc1b3be51dae6670836e'},
 {'start': 173,
  'end': 197,
  'label': 'OUTCOME',
  'label_upper': 'OUTCOME',
  'text': 'increase quality of life',
  'id': 'c0a249d408bf13c1f9c899081488f6febf24fd05e54110e3275456d915899e14'},
 {'start': 234,
  'end': 250,
  'label': 'HEALTH',
  'label_upper': 'HEALTH',
  'text': 'cancer diagnosis',
  'id': '1a7804cb3ecf89ae73509bc6fcc2e5222ced57137bbc2b5a71c6232c614c056c'},
 {'start': 349,
  'end': 357,
  'label': 'OUTCOME',
  'label_upper': 'OUTCOME',
  'text'

In [191]:
parsed_ents[1]

{'text': "The initial goals of this research project were to establish feasibility and safety for a hallucinogen treatment model in patients with advanced-stage cancer and anxiety. Following discussion with federal and state regulatory agencies as well as hospital institutional review board and research committees, a modest 0.2-mg/kg psilocybin dose was chosen. Although not comparable to higher doses of hallucinogens administered in the past to severely ill patients, the dose used here was still believed capable of inducing an alteration of consciousness with potential therapeutic benefit while optimizing patient safety. Determining safe parameters with this novel treatment paradigm is critical to establishing a strong foundation for this field of study that would allow for future investigations. Consistent with previous research, we found no untoward cardiovascular sequelae in we subject population.19 Minor HR and BP elevations after psilocybin administration were evidence only of a m

#### Save the parsed ents 

In [173]:
# -- Save as pickled object for loading into colab -- 
with open('4_Knowledge_Graphs/output/parsed_ents.pickle', 'wb') as f:
    pickle.dump(parsed_ents, f)

In [174]:
#  -- Check the types remain the same when re-loaded 
with open('4_Knowledge_Graphs/output/parsed_ents.pickle', 'rb') as f:
     test= pickle.load(f)

In [180]:
print(f"The parsed entities are type: {type(test)}.")
print(f"The columns within parsed ents are type: {type(test[0])}.")
print(f"The parsed entities are of length: {len(test)}.") 

The parsed entities are type: <class 'list'>.
The columns within parsed ents are type: <class 'dict'>.
The parsed entities are of length: 84.


_All looks good!_

In [43]:
# --- save the output as a csv file --- 
keys = parsed_ents[5].keys()

parsed_entities = open("4_Knowledge_Graphs/output/parsed_entities.csv", "w")
dict_writer = csv.DictWriter(parsed_entities, keys)
dict_writer.writeheader()
dict_writer.writerows(parsed_ents)
parsed_entities.close()

In [45]:
# -- inspect saved csv file -- 
inspection = pd.read_csv('4_Knowledge_Graphs/output/parsed_entities.csv')
inspection[:3]

Unnamed: 0,text,annotations,text_sha256
0,The present study demonstrated the efficacy of...,"[{'start': 62, 'end': 72, 'label': 'DRUG', 'la...",f624d5abadc6418f41ea425eb53506e811e19f239c7681...
1,The initial goals of this research project wer...,"[{'start': 90, 'end': 102, 'label': 'DRUG', 'l...",9a11e128760948c78f95d7f0d9f5369790b83f686c75ae...
2,"Dr. Albert Hofmann, the natural products chemi...","[{'start': 90, 'end': 93, 'label': 'DRUG', 'la...",ac9c65712e00a2d59a49e1f0b04580c9e8adeeae5bb32e...


__More simple parsing: Extract entities to include only their entity text and label__

In [17]:
# --- more simple parsing --- 
article_word = []
article_label = []


for doc in nlp.pipe(discussions, disable=["tagger", "parser"]):
    for ent in doc.ents:
        text = ent.text
        article_word.append(text)
        
        label = ent.label_
        article_label.append(label)

In [20]:
print(f"There are {len(article_word)} entities extracted across the 84 documents.")
print(f"There are {len(article_label)} labels extracted for these entities.")

There are 9903 entities extracted across the 84 documents.
There are 9903 labels extracted for these entities.


In [33]:
# -- Create a list of the text and labels --
def merge(list1, list2):
      
    merged_list = []
    for i in range(max((len(list1), len(list2)))):
  
        while True:
            try:
                tup = (list1[i], list2[i])
            except IndexError:
                if len(list1) > len(list2):
                    list2.append('')
                    tup = (list1[i], list2[i])
                elif len(list1) < len(list2):
                    list1.append('')
                    tup = (list1[i], list2[i])
                continue
  
            merged_list.append(tup)
            break
    return merged_list

In [36]:
tuple_list = merge(article_word, article_label)
print(f"\ntuple_list is a type: {type(tuple_list)}.")
print(f"\ntuple_list is length: {len(tuple_list)}.")


tuple_list is a type: <class 'list'>.

tuple_list is length: 9903.


In [35]:
tuple_list[:3]

[('psilocybin', 'DRUG'), ('depressed mood', 'HEALTH'), ('anxiety', 'HEALTH')]

In [39]:
# -- save the list of tuples -- 
with open("4_Knowledge_Graphs/output/entity_labels.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(zip(article_word, article_label))

## Step 2: Connect the NER with Relation Extraction 

In [46]:
# -- import the necessary dependencies -- 
import random
import typer
from pathlib import Path
import spacy
from spacy.tokens import DocBin, Doc
from spacy.training.example import Example

In [47]:
# -- extract the functions from our created model in the rel_component folder -- 
from rel_component.scripts.rel_pipe import make_relation_extractor, score_relations

In [48]:
# -- extract necessary functions from our config folder -- 
from rel_component.scripts.rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors

In [49]:
# -- load the Relation extraction model --
nlp2 = spacy.load("../KGs/rel_component/training/model-best")

### Extract relationships 

In [53]:
# -- function to extract our relations -- 
def extract_relations(documents,nlp,nlp2):
    # create an empty list to save the extracted info into 
    predicted_rels = list()
    
    # for every psychedelic article, we want to take the ner elements
    for doc in nlp.pipe(documents, disable=["tagger", "parser"]):
        #Then create an understandable hash tag which can be used to identify the entities (like an id number)
        source_hash = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
        
        #initiate the doc objects from the relation extraction pipeline 
        for name, proc in nlp2.pipeline:
              doc = proc(doc)
        
        #then for each value and relation dictionary in the doc object 
        for value, rel_dict in doc._.rel.items():
            #extract the first entity
            for e in doc.ents:
                #and also the second entity
                for b in doc.ents:
                    #see if the entities match 
                    if e.start == value[0] and b.start == value[1]:
                        #classify the relationships
                        max_key = max(rel_dict, key=rel_dict. get)
                        
                        #extract the unique ids for each entity 
                        e_id = hashlib.sha256(str(e).encode('utf-8')).hexdigest()
                        b_id = hashlib.sha256(str(b).encode('utf-8')).hexdigest()
                        #if the relationship probability is higher than 90% then store it in the predicted rels list
                        if rel_dict[max_key] >=0.9 :
                            predicted_rels.append({'head': e_id, 'tail': b_id, 'type':max_key, 'source': source_hash})
    return predicted_rels

_Beware the below code takes ~2hrs to run_

In [54]:
psychedelic_relationships = extract_relations(discussions,nlp,nlp2)

In [55]:
# -- check how many relationships have been classified -- 
len(psychedelic_relationships)

3348

In [63]:
# -- inspect the relationships -- 
psychedelic_relationships[501:506] # remember our nodes are in this hashlib source langauge 

[{'head': '062e0d8b0394241db98b94a14a411158c5cb469f6ebb300dd7c0003c00253b75',
  'tail': 'd6ef29ea86f326d5337f7f2de9a90c84688dadedb339dcfdb0b6582e3965a04a',
  'type': 'THOUGHT TO CAUSE',
  'source': '6cc23b09fa92f4cb146e4331836d768828cde4c7dfb338f687823040ba941ab7'},
 {'head': '062e0d8b0394241db98b94a14a411158c5cb469f6ebb300dd7c0003c00253b75',
  'tail': '96e05a77d6248d9ef953df9234112c4128d2624cbcdb1682019783fda7c84cb0',
  'type': 'THOUGHT TO CAUSE',
  'source': '6cc23b09fa92f4cb146e4331836d768828cde4c7dfb338f687823040ba941ab7'},
 {'head': '062e0d8b0394241db98b94a14a411158c5cb469f6ebb300dd7c0003c00253b75',
  'tail': 'be2a9e82fe321a0ea50d7e8a92f148f272302f0a862bcdebba7d459d7322d7c8',
  'type': 'THOUGHT TO CAUSE',
  'source': '6cc23b09fa92f4cb146e4331836d768828cde4c7dfb338f687823040ba941ab7'},
 {'head': '55b804e1155b33b0b6cb16e81ec7dd406e046cb731d4e0334accd777990d9c95',
  'tail': 'd5c9f01aec1fa5708b3c9bdeee3d0468721d2b8aab55ee1b20013213a674520a',
  'type': 'POSITIVELY IMPACTS',
  'source':

#### Getting more meaningful output than simply entity ID 

In [134]:
# --- Function to visualise the entities ---
def extract_relations_visualisation(documents,nlp,nlp2):
    predicted_rels = list()
    for doc in tqdm(nlp.pipe(documents, disable=["tagger", "parser"])):
        source_hash = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
        for name, proc in nlp2.pipeline:
              doc = proc(doc)

        for value, rel_dict in doc._.rel.items():
            for e in doc.ents:
                for b in doc.ents:
                    if e.start == value[0] and b.start == value[1]:
                        max_key = max(rel_dict, key=rel_dict. get)
            #print(max_key)
                        e_id = hashlib.sha256(str(e).encode('utf-8')).hexdigest()
                        b_id = hashlib.sha256(str(b).encode('utf-8')).hexdigest()
                        if rel_dict[max_key] >=0.9 :
                            #print(f" entities: {e.text, b.text} --> predicted relation: {rel_dict}")
                            predicted_rels.append({'entity 1': e.text, 'entity 1 label': e.label_, 'entity 2': b.text, 'entity 2 label': b.label_, 'type':max_key, 'source': source_hash})
    return predicted_rels

In [135]:
# --- Apply to the full dataset to get list of dictionaries --- 
relations = extract_relations_visualisation(discussions, nlp, nlp2)

84it [1:26:56, 62.10s/it] 


In [143]:
# -- inspect relations -- 
print(f"There are {len(relations)} relations detected within the discussions.")
print(f"These are of type {type(relations)}, where each row is a {type(relations[0])}.")
print(f"An example of our dictionary is:\n {relations[:1]}")

There are 3348 relations detected within the discussions.
These are of type <class 'list'>, where each row is a <class 'dict'>.
An example of our dictionary is:
 [{'entity 1': 'psilocybin', 'entity 1 label': 'DRUG', 'entity 2': 'psychological distress', 'entity 2 label': 'OUTCOME', 'type': 'POSITIVELY IMPACTS', 'source': 'f624d5abadc6418f41ea425eb53506e811e19f239c768100097c78295dadf858'}]


In [150]:
# -- Convert our list of dictionaries into a dataframe 
relations_df = pd.DataFrame(relations)

In [156]:
relations_count = relations_df.groupby('type').count()
relations_count['entity 1']

type
NEEDS MORE RESEARCH BETWEEN       205
NEGATIVELY IMPACTS                 55
NO RELATIONSHIP FOUND BETWEEN      37
POSITIVELY IMPACTS               2620
THOUGHT TO CAUSE                  431
Name: entity 1, dtype: int64

_We can see that our model is over estimating the impact of 'positively impacts' and giving almost no attention to the 'negatively impacts' column. This is likely because the model was fed an uneven amount of data, where there was significantly more instances of 'positively impacts' within the data and also because of the contrasts between. This makes us skeptical of our outcomes, but we'll proceed for now to see if we can still extract meaningful insights._ 

#### Save the relations information 

In [183]:
# --- Save as a pickle file for reloading ---
with open('4_Knowledge_Graphs/output/relations_pickled.pickle', 'wb') as f:
    pickle.dump(relations, f)

In [185]:
#  -- Check the types remain the same when re-loaded 
with open('4_Knowledge_Graphs/output/relations_pickled.pickle', 'rb') as f:
     rel_test= pickle.load(f)

In [189]:
print(f"The rel_test is len: {len(rel_test)}.")
print(f"The rel_test is type: {type(rel_test)}.")
print(f"The rel_test has variables of type: {type(rel_test[0])}.")

The rel_test is len: 3348.
The rel_test is type: <class 'list'>.
The rel_test has variables of type: <class 'dict'>.


In [159]:
# -- Save a more readable csv file also -- 
relations_df.to_csv('4_Knowledge_Graphs/output/relations_df.csv')

_Please move to the 'Creating_Knowledge_Graphs.ipynb' file to inspect the creation of the knowledge graphs_