# Using the NER and Relation Extraction models to create Knowledge Graphs 

In our final step of the pipeline, knowledge graphs are created using the neo4j package in python. These can be seen as an exploration of the data and can be used to give us an idea of what sort of insights can be extracted from the pipeline. <br> <br>

The Google colab notebook can be accessed here: 
https://colab.research.google.com/drive/1PJDpQqAjoRHcTfZIJRnMuf-Q4adBmjwW?usp=sharing
<br>
<br>

#### Neo4j: Native Graph Database 
This package allows for the creation of deeply connected knowledge graph using its advanced machine learning tool kit. The software integrates with python using a 'Bolt URL', username (often just 'neo4j') and a unique password created when you create a new blank project (known to the langauge as a 'sandbox'). 

The data created from our NER and Relation Extraction models is fed to the graph as a csv file, where column names can be used to define nodes and relationships. Neo4j uses 'Cypher' language to analyse and query the data. A guide on this syntax can be found here: https://neo4j.com/docs/cypher-refcard/current/

## Setup

In [1]:
# -- Mount Google Drive --
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
# -- Change working directory --
cd /content/gdrive/MyDrive

/content/gdrive/MyDrive


#### Google Colab installations

In [3]:
!pip install neo4j-driver



In [3]:
# -- General dependencies -- 
import pandas as pd
import json
import csv
from tqdm import tqdm
import pickle

# -- Knowledge graph functions -- 
from neo4j import GraphDatabase
from neo4j import basic_auth

#### Load the Data 

In [4]:
# -- Import coreferenced texts -- 
psychedelic_articles = pd.read_csv('/content/gdrive/MyDrive/relations_df.csv')

In [5]:
# -- Inspect the dataframe --
psychedelic_articles[:3]

Unnamed: 0.1,Unnamed: 0,entity 1,entity 1 label,entity 2,entity 2 label,type,source
0,0,psilocybin,DRUG,psychological distress,OUTCOME,POSITIVELY IMPACTS,f624d5abadc6418f41ea425eb53506e811e19f239c7681...
1,1,psilocybin,DRUG,anxiety,HEALTH,POSITIVELY IMPACTS,f624d5abadc6418f41ea425eb53506e811e19f239c7681...
2,2,psilocybin,DRUG,suicidality,OUTCOME,POSITIVELY IMPACTS,f624d5abadc6418f41ea425eb53506e811e19f239c7681...


In [6]:
# -- Remove Unamed column --
psychedelic_articles = psychedelic_articles.drop(columns="Unnamed: 0")
len(psychedelic_articles)

3348

In [7]:
# -- Create a simple table of entity pairs with relation type between them -- 
relations_simplified = psychedelic_articles.drop(columns="source")

In [8]:
relations_simplified[:3]

Unnamed: 0,entity 1,entity 1 label,entity 2,entity 2 label,type
0,psilocybin,DRUG,psychological distress,OUTCOME,POSITIVELY IMPACTS
1,psilocybin,DRUG,anxiety,HEALTH,POSITIVELY IMPACTS
2,psilocybin,DRUG,suicidality,OUTCOME,POSITIVELY IMPACTS


In [9]:
# -- Remove duplicate Rows -- 
print(f"There are {len(relations_simplified)} pairs of relations found.\n")

relations_simplified = relations_simplified.drop_duplicates(['entity 1','entity 2', 'type'],keep= 'first')

print(f"After removing duplicates there are {len(relations_simplified)} pairs of relations found.")

There are 3348 pairs of relations found.

After removing duplicates there are 1995 pairs of relations found.


### Create Subset Dataframees  

#### Risks

In [10]:
# Create a dataframe of 'NEGATIVELY IMPACTS'
neg_impacts = relations_simplified.loc[relations_simplified['type'] == 'NEGATIVELY IMPACTS']
print(len(neg_impacts))

47


In [11]:
# -- save negatively impacts df -- 
neg_impacts.to_csv('negatively_impacts.csv')

In [12]:
# Create a dataframe of 'POSITIVELY IMPACTS'
pos_impacts = relations_simplified.loc[relations_simplified['type'] == 'POSITIVELY IMPACTS']
len(pos_impacts)

1458

In [32]:
# -- save positively impacts df -- 
pos_impacts.to_csv('positively_impacts.csv')

#### Load the NER data

In [13]:
#  -- Import the pickled object -- 
with open('/content/gdrive/MyDrive/parsed_ents.pickle', 'rb') as f:
     parsed_ents= pickle.load(f)

In [14]:
# -- check the types of the parsed ents -- 
print(f"The parsed entities are type: {type(parsed_ents)}.")
print(f"The columns within parsed ents are type: {type(parsed_ents[0])}.")
print(f"The parsed entities are of length: {len(parsed_ents)}.") 

The parsed entities are type: <class 'list'>.
The columns within parsed ents are type: <class 'dict'>.
The parsed entities are of length: 84.


In [None]:
# -- Inspect -- 
parsed_ents[1]

In [15]:
# -- load relationships --
with open('/content/gdrive/MyDrive/relations_pickled.pickle', 'rb') as f:
     predicted_rels= pickle.load(f)

In [16]:
print(f"The predicted relationships are type: {type(predicted_rels)}.")
print(f"The columns within predicted_rels are type: {type(predicted_rels[0])}.")
print(f"The predicted_rels are of length: {len(predicted_rels)}.") 

The predicted relationships are type: <class 'list'>.
The columns within predicted_rels are type: <class 'dict'>.
The predicted_rels are of length: 3348.


In [17]:
predicted_rels[1]

{'head': 'psilocybin',
 'headLabel': 'DRUG',
 'source': 'f624d5abadc6418f41ea425eb53506e811e19f239c768100097c78295dadf858',
 'tail': 'anxiety',
 'tailLabel': 'HEALTH',
 'type': 'POSITIVELY IMPACTS'}

In [17]:
# -- load subset entity --
with open('/content/gdrive/MyDrive/subset_ents.pickle', 'rb') as f:
     subset_ents= pickle.load(f)

In [18]:
# -- load subset relationships --
with open('/content/gdrive/MyDrive/subset_rels.pickle', 'rb') as f:
     subset_rels= pickle.load(f)

In [19]:
type(subset_ents[1])

dict

## Connecting to our Neo4j Sandbox

In the section below, we will use our data to draw knowledge graphs for particular entities and the relations between them.

First, we call our sandbox project and set up the driver,  which will act as a laison between python here and our graphing visualisations (which are stored in the sandbox on the neo4j browser). Next, we append our data from variables 'parsed_ents' and 'predicted_rels' to populate the graph with our entities and their relationships. 

_We insert our username ('neo4j') and the auto-generated password ('worksheet-choice-perforation') along with the uri ('bolt://3.235.132.60:7687') to connect._

In [18]:
# -- Call our Sandbox project -- 

host = "bolt://3.235.132.60:7687"
user = 'neo4j'
password = "worksheet-choice-perforation"
driver = GraphDatabase.driver(host,auth=(user, password))

cypher_query = '''
MATCH (n)
RETURN COUNT(n) AS count
LIMIT $limit
'''

with driver.session(database="neo4j") as session:
  results = session.read_transaction(
    lambda tx: tx.run(cypher_query,
                      limit=10).data())
  for record in results:
    print(record['count'])

driver.close()

1


In [19]:
def neo4j_query(query, params=None):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

In [26]:
#clean your current neo4j sandbox db (remove everything)
neo4j_query("""
MATCH (n) DETACH DELETE n;
""")

In [27]:
#Create a first main node
neo4j_query("""
MERGE (d:Drug {name:"psilocybin"}) 
RETURN d
""")

Unnamed: 0,d
0,(name)


In [28]:
# -- Add the entities to the graph from our parsed ents code -- 
neo4j_query("""
MATCH (d:Drug)
UNWIND $data as row
MERGE (a:Article{id:row.text_sha256})
SET a.text = row.text
MERGE (d)-[:MENTIONED_IN]->(a)
WITH a, row.annotations as entities
UNWIND entities as entity
MERGE (e:Entity{id:entity.id})
ON CREATE SET e.name = entity.text,
              e.type = entity.label
MERGE (a)-[m:MENTIONS]->(e)
ON CREATE SET m.count = 1
ON MATCH SET m.count = m.count + 1
""", {'data': parsed_ents})

In [29]:
# -- Add property names to our nodes --
res = neo4j_query("""
MATCH (e:DRUG)
RETURN e.id as id, e.name as name
""")

In [30]:
# -- Add relations -- 
neo4j_query("""
UNWIND $data as row
MATCH (source:Entity {id: row.head})
MATCH (target:Entity {id: row.tail})
MATCH (text:Article {id: row.source})
MERGE (source)-[:REL]->(r:Relation {type: row.type})-[:REL]->(target)
MERGE (text)-[:POSITIVELY_IMPACTS]->(r)
""", {'data': predicted_rels})

## Data exploring 

Please note, the main graphs are stored in the neo4j browser. You can enter this up up until 26th January and play around with the entities. You can also query information within your graph, such as return all articles that mention a particular entity as below, or count how many times entities co-occur.

In [36]:
# -- Return articles which mention 'psilocybin' --
query = """
MATCH (e:Entity)<-[:MENTIONS]-(a:Article)
WHERE e.name = 'psilocybin'
RETURN a.text as result
LIMIT 5 // We set a small limit just to explore 
"""
res = neo4j_query(query)
res

Unnamed: 0,result
0,"In this study, we analyzed the fMRI data of we..."
1,The current open-label pilot study identified ...
2,"In summary, we hypothesize that the induction ..."
3,This is the first preregistered report on micr...
4,By exploring the effects of ayahuasca intake o...


In [39]:
# -- Find the 10 most commonly occuring entity pairs --
query = """
MATCH (e1:Entity)<-[:MENTIONS]-()-[:MENTIONS]->(e2:Entity)
WHERE id(e1) < id(e2)
RETURN e1.name as entity1, 
       e2.name as entity2, 
       count(*) as cooccurrence
ORDER BY cooccurrence
DESC LIMIT 10
"""

cooccurances = neo4j_query(query)
cooccurances

Unnamed: 0,entity1,entity2,cooccurrence
0,psilocybin,depression,48
1,psilocybin,anxiety,47
2,anxiety,depression,41
3,psilocybin,psychedelics,40
4,depression,psychedelics,34
5,psilocybin,LSD,34
6,psilocybin,psychedelic,33
7,anxiety,psychedelics,30
8,psychedelics,psychedelic,30
9,anxiety,LSD,27


## Graph of Risks 

In [126]:
# -- Function to create first nodes -- 
def create_first_nodes(entities, labels):
    # Adds category nodes to the Neo4j graph.
    query = '''
            UNWIND $rows AS row
            MERGE (r1:Risk1 {risk: row.ent1})
            SET r1.label = row.ent1Labels
            RETURN count(*) as total
            '''
    return conn.query(query, parameters = {'rows':entities.to_dict('records')})



In [127]:
# -- Extract the first entities from Dataframe --
ents1 = pd.DataFrame(neg_impacts[['ent1']])
len(ents1)

47

In [128]:
# -- Extract the first entities labels -- 
ent1Labels = pd.DataFrame(neg_impacts[['ent1Label']])
print(len(ent1Labels))

47


In [129]:
# -- Create the nodes -- 
create_first_nodes(ents1, ent1Labels)

[<Record total=47>]

In [130]:
# -- create entity 2 nodes -- 

def create_second_nodes(entities, labels):
    # Adds entity1 nodes to the Neo4j graph.
    query = '''
            UNWIND $rows AS row
            MERGE (r2:Risk2 {risk: row.ent2})
            SET r2.label = row.ent2Labels
            RETURN count(*) as total
            '''
    return conn.query(query, parameters = {'rows':entities.to_dict('records')})

In [131]:
# -- Extract second entities -- 
ents2 = pd.DataFrame(neg_impacts[['ent2']])

In [132]:
# -- Extract entity labels -- 
ent2Labels = pd.DataFrame(neg_impacts[['ent2Label']])

In [133]:
# -- Create the 2nd nodes -- 
create_second_nodes(ents2, ent2Labels)

[<Record total=47>]

In [145]:
# --- create an initial entity --- 
neo4j_query("""
MERGE (e:Entity {name:"psilocybin"}) 
RETURN e
""")

Unnamed: 0,e
0,(name)


In [140]:
# Adding risk relations 
def add_relations(df):
   query = '''
   MATCH(e:Entity)
   UNWIND $data as row
   MERGE (r2:Risk2{risk:row.ent2})
   SET r2.text = row.ent2Label
   MERGE (e)-[r:RISKS]->(r2)
   RETURN e, r2
   '''
   return conn.query(query, parameters = {'data': df })

In [139]:
add_relations(neg_impacts)

Query failed: {code: Neo.ClientError.Statement.ParameterMissing} {message: Expected parameter(s): data}


_Please head over over to the neo4j browser to query requests or check out the output folder on github_