# Knowledge Graph Data Extraction

## Description

In this notebook:  

- connect to an existing Neo4j graph database,
- extract schema and relevant KG information,
- extract a set of instances to use as samples,
- save relevant files for later use.

## Workspace Setup

In [None]:
%pip install neo4j
%pip install python-levenshtein



In [None]:
# Load and mount the drive helper
from google.colab import drive

# This will prompt for authorization
drive.mount('/content/drive')

# Set the working directory
%cd '/content/drive/MyDrive/cypherGen/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/cypherGen


In [None]:
# Upload the Neo4j graph database credentials
# Neo4j graph database credentials
#URI = 'neo4j+s://xxxxxxxx.databases.neo4j.io'
#USER = 'neo4j'
#PWD = 'your password here'

from google.colab import userdata
URI = userdata.get('URI')
PWD = userdata.get('PWD')

In [None]:
# Necessary imports
import neo4j
import pandas as pd
import random
import itertools

# Import the local modules
from utils.utilities import *
from utils.neo4j_conn import *
from utils.neo4j_schema import *
from utils.graph_utils import *

In [None]:
# Initialize the Neo4j connector and utilities modules
graph=Neo4jGraph(url=URI, username='neo4j', password=PWD)
gutils = Neo4jSchema(url=URI, username='neo4j', password=PWD)

In [None]:
# Check the graph connection
graph.query("MATCH (n) RETURN count(n)")

[{'count(n)': 38650}]

In [None]:
# Create a path variable for the data folder
data_path = '/content/drive/MyDrive/cypherGen/datas/'

# Set file names
schema_file = 'schema_file.json'  # schema as JSON object
formatted_schema_file = 'formatted_schema.txt' # schema as a string to be included with prompt
node_instances_file = 'node_instances_file.json' # set of node instances as JSON object
rels_instances_file = 'rels_instances_file.json' # set of relationship instances as JSON object

## Extract data from KG

In [None]:
# Build the string schema
schema = gutils.get_schema

# Save the string schema to a file
with open(data_path+formatted_schema_file, 'w') as f:
    f.write(schema)

In [None]:
# The string schema
print(schema)

Node properties are the following:
Article {abstract: STRING, article_id: INTEGER, comments: STRING, title: STRING},Keyword {name: STRING, key_id: STRING},Topic {cluster: INTEGER, description: STRING, label: STRING},Author {author_id: STRING, affiliation: STRING, first_name: STRING, last_name: STRING},DOI {name: STRING, doi_id: STRING},Categories {category_id: STRING, specifications: STRING},Report {report_id: STRING, report_no: STRING},UpdateDate {update_date: DATE},Journal {name: STRING, journal_id: STRING}
Relationship properties are the following:
PUBLISHED_IN {meta: STRING, pages: STRING, year: INTEGER}
The relationships are the following:
(:Article)-[:HAS_KEY]->(:Keyword),(:Article)-[:HAS_DOI]->(:DOI),(:Article)-[:HAS_CATEGORY]->(:Categories),(:Article)-[:WRITTEN_BY]->(:Author),(:Article)-[:UPDATED]->(:UpdateDate),(:Article)-[:PUBLISHED_IN]->(:Journal),(:Article)-[:HAS_REPORT]->(:Report),(:Keyword)-[:HAS_TOPIC]->(:Topic)


In [None]:
# Build the json schema
jschema = gutils.get_structured_schema
# Check the output
jschema.keys()

dict_keys(['node_props', 'rel_props', 'relationships'])

In [None]:
# Extract the list of nodes
nodes = get_nodes_list(jschema)
print(nodes)

['Article', 'Keyword', 'Topic', 'Author', 'DOI', 'Categories', 'Report', 'UpdateDate', 'Journal']


In [None]:
# Read the nodes with their properties and their datatypes
node_props_types = jschema['node_props']
# Check the output

print(f"The properties of the node Report are:\n{node_props_types['Report']}")

The properties of the node Report are:
[{'property': 'report_id', 'datatype': 'STRING'}, {'property': 'report_no', 'datatype': 'STRING'}]


In [None]:
# Extract the relationships
relationships = jschema['relationships']
print("The relationships in the graph are:\n")
relationships

The relationships in the graph are:



[{'start': 'Article', 'type': 'HAS_KEY', 'end': 'Keyword'},
 {'start': 'Article', 'type': 'HAS_DOI', 'end': 'DOI'},
 {'start': 'Article', 'type': 'HAS_CATEGORY', 'end': 'Categories'},
 {'start': 'Article', 'type': 'WRITTEN_BY', 'end': 'Author'},
 {'start': 'Article', 'type': 'UPDATED', 'end': 'UpdateDate'},
 {'start': 'Article', 'type': 'PUBLISHED_IN', 'end': 'Journal'},
 {'start': 'Article', 'type': 'HAS_REPORT', 'end': 'Report'},
 {'start': 'Keyword', 'type': 'HAS_TOPIC', 'end': 'Topic'}]

In [None]:
# Extract node samples from the graph - 4 sets of node samples
node_instances = gutils.extract_node_instances(nodes, # list of nodes to extract labels
                                               4)  # how many instances to extract
# We have a list of sublists, one for each node label in the provided list
node_instances[2]

[{'Instance': {'Label': 'Topic',
   'properties': {'description': 'The study of how populations grow, decline, and evolve over time, with a focus on understanding the underlying mechanisms and patterns that govern these processes. Key concepts include discrete log problems, logarithmic barriers, intermediate and super-exponential growth, layer-by-layer growth, and population dynamics from a superpopulation viewpoint. Topics also include population genetics, selection, and the role of logarithms in various contexts such as gain, log-balanced, log-price, and log resolution. Additionally, there is interest in understanding the relationship',
    'cluster': 0,
    'label': 'Population Dynamics_0'}}},
 {'Instance': {'Label': 'Topic',
   'properties': {'description': 'Focusing on techniques and concepts related to transformations, solutions, and properties of linear equations and matrices, including Jordan normal form, eigenvalues, eigenvectors, diagonalization, and eigenformulations.',
    

In [None]:
# Extract relationship instances
rels_instances = gutils.extract_multiple_relationships_instances(relationships, # list of relationships to extract instances for
                                                                 8)  # how many instances to extract for each relationship
# A list of sublists with 8 entries for each relatonship type
rels_instances[0][0]

{'Article_Start': {'article_id': 1006,
  'comments': '21 pages, AMS-LaTeX',
  'abstract': '  Using matrix inversion and determinant evaluation techniques we prove several\nsummation and transformation formulas for terminating, balanced,\nvery-well-poised, elliptic hypergeometric series.\n',
  'title': 'Summation and transformation formulas for elliptic hypergeometric series'},
 'HAS_KEY': {},
 'Keyword_End': {'key_id': '720452e14ca2e4e07b76fa5a9bc0b5f6',
  'name': 'summation'}}

In [None]:
# Serialize extracted neo4j.time data - for saving to json files
nodes_instances_serialized = serialize_nodes_data(node_instances)
rels_instances_serialized = serialize_relationships_data(rels_instances)

In [None]:
# When working with large schema KG is better to provide a subschema only, based on a node selection
# This will extract first neighbors and all the corresponding relationships
get_subgraph_schema(jschema, ['Topic'], # nodes to extract information for
                    2, # Levenshtein distance (actual node label, provided label)
                    True) # formated as a string

'Node properties are the following:\nTopic {cluster: INTEGER, description: STRING, label: STRING}\nRelationship properties are the following:\n\nThe relationships are the following:\n(:Keyword)-[:HAS_TOPIC]->(:Topic)'

In [None]:
# Find the datatypes present in the graph
dtypes = retrieve_datatypes(jschema)
dtypes

{'DATE', 'INTEGER', 'STRING'}

In [None]:
# Save data to json files
write_json(jschema, data_path+schema_file)
write_json(nodes_instances_serialized, data_path+node_instances_file)
write_json(rels_instances_serialized, data_path+rels_instances_file)