<a href="https://colab.research.google.com/github/SolanaO/Knowledge_Graphs_Assortment/blob/master/arXiv_KG/3_ArXiv_KG_Queries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Queries on ArXiv Knowledge Graph


## Description

In this notebook, we query the KG built in notebook 2. This notebook does not require any special settings, as we will only connect to an existing  Neo4j instance.


## Colab Setup

In [None]:
# Load and mount the drive helper
from google.colab import drive

# This will prompt for authorization
drive.mount('/content/drive')

# Set the working directory
%cd '/content/drive/MyDrive/arxivKG/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/arxivKG


## Installs & Imports for KG Querying

In [None]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-5.16.0.tar.gz (197 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/197.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m112.6/197.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: neo4j
  Building wheel for neo4j (pyproject.toml) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-5.16.0-py3-none-any.whl size=273811 sha256=7631d10efbb3508f70417a3db640b65db5d76d2ea01f09693af1894f46b0a7a6
  Stored in directory: /root/.cache/pip/wheels/20/a0/f6/87a1ec9636c915

In [None]:
import pandas as pd
import numpy as np
import json
from datetime import datetime
import hashlib
from neo4j import time


## Establish Neo 4j Connection

In [None]:
# Import the Neo4j connector module
from utils.neo4j_conn import *

In [None]:
# Create an Neo4j AuraDB free instance and collect the credentials
URI = 'neo4j+s://xxxxxxxx.databases.neo4j.io'
USER = 'neo4j'
PWD = 'your_password_here'

# Initialize the Neo4j connector
graph=Neo4jGraph(url=URI, username=USER, password=PWD)

In [None]:
# Check the connection
graph.query("MATCH (n) RETURN count(n)")

[{'count(n)': 38650}]

## Sample Queries

In [None]:
# Query to extract the graph schema
node_properties_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE NOT type = "RELATIONSHIP" AND elementType = "node"
WITH label AS nodeLabels, collect(property) AS properties
RETURN {labels: nodeLabels, properties: properties} AS output

"""
node_props = graph.query(node_properties_query)
node_props

[{'output': {'labels': 'Article',
   'properties': ['abstract', 'article_id', 'comments', 'title']}},
 {'output': {'labels': 'Keyword', 'properties': ['name', 'key_id']}},
 {'output': {'labels': 'Topic',
   'properties': ['cluster', 'description', 'label']}},
 {'output': {'labels': 'Author',
   'properties': ['author_id', 'affiliation', 'first_name', 'last_name']}},
 {'output': {'labels': 'DOI', 'properties': ['name', 'doi_id']}},
 {'output': {'labels': 'Categories',
   'properties': ['category_id', 'specifications']}},
 {'output': {'labels': 'Report', 'properties': ['report_id', 'report_no']}},
 {'output': {'labels': 'UpdateDate', 'properties': ['update_date']}},
 {'output': {'labels': 'Journal', 'properties': ['name', 'journal_id']}}]

In [None]:
# Query to extract relationships list
rel_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE type = "RELATIONSHIP" AND elementType = "node"
RETURN {source: label, relationship: property, target: other} AS output
"""
rels = graph.query(rel_query)
rels

[{'output': {'relationship': 'HAS_KEY',
   'source': 'Article',
   'target': ['Keyword']}},
 {'output': {'relationship': 'HAS_DOI',
   'source': 'Article',
   'target': ['DOI']}},
 {'output': {'relationship': 'HAS_CATEGORY',
   'source': 'Article',
   'target': ['Categories']}},
 {'output': {'relationship': 'WRITTEN_BY',
   'source': 'Article',
   'target': ['Author']}},
 {'output': {'relationship': 'UPDATED',
   'source': 'Article',
   'target': ['UpdateDate']}},
 {'output': {'relationship': 'PUBLISHED_IN',
   'source': 'Article',
   'target': ['Journal']}},
 {'output': {'relationship': 'HAS_REPORT',
   'source': 'Article',
   'target': ['Report']}},
 {'output': {'relationship': 'HAS_TOPIC',
   'source': 'Keyword',
   'target': ['Topic']}}]

In [None]:
# Find 5 articles that contain algebra in the title and abstract

read_query = """
MATCH (a:Article)
WHERE a.abstract CONTAINS 'algebra' AND a.title CONTAINS 'algebra'
RETURN a.title as Title, a.abstract AS Abstract
LIMIT 5
"""
graph.query(read_query)

[{'Title': 'The Gervais-Neveu-Felder equation for the Jordanian quasi-Hopf\n  U_{h;y}(sl(2)) algebra',
  'Abstract': '  Using a contraction procedure, we construct a twist operator that satisfies a\nshifted cocycle condition, and leads to the Jordanian quasi-Hopf U_{h;y}(sl(2))\nalgebra. The corresponding universal ${\\cal R}_{h}(y)$ matrix obeys a\nGervais-Neveu-Felder equation associated with the U_{h;y}(sl(2)) algebra. For a\nclass of representations, the dynamical Yang-Baxter equation may be expressed\nas a compatibility condition for the algebra of the Lax operators.\n'},
 {'Title': 'Twist deformations for generalized Heisenberg algebras',
  'Abstract': '  Multidimensional Heisenberg algebras, whose creation and annihilation\noperators are the N-dimensional vectors, can be injected into simple Lie\nalgebras g. It is demonstrated that the spectrum of their deformations can be\ninvestigated using chains of extended Jordanian twists applied to U(g). In the\ncase of U(sl(N)) (for N>5)

In [None]:
# Basic node retrieval
# Fetch 5 journals in the database

query = """
MATCH (j:Journal)
RETURN j.name LIMIT 5
"""

graph.query(query)

[{'j.name': 'Rev. Mat.Iberoamericana'},
 {'j.name': 'Finite Fields Appl'},
 {'j.name': 'J. Geom. Analysis'},
 {'j.name': 'Comm. Partial Differential Equations'},
 {'j.name': 'Proyecciones'}]

In [None]:
# Find the most published author

read_query="""
MATCH (a:Author)-[]-(p:Article)-[]-(j:Journal)
RETURN a.last_name as LastName, a.first_name AS FirstNAme, count(p) as Freq
ORDER BY Freq DESC
LIMIT 5
"""
graph.query(read_query)

[{'LastName': 'Schick', 'FirstNAme': 'Thomas', 'Freq': 27},
 {'LastName': 'Bartholdi', 'FirstNAme': 'Laurent', 'Freq': 23},
 {'LastName': 'Kotschick', 'FirstNAme': 'D.', 'Freq': 20},
 {'LastName': 'Chakrabarti', 'FirstNAme': 'A.', 'Freq': 18},
 {'LastName': 'Suciu', 'FirstNAme': 'Alexander I.', 'Freq': 18}]

In [None]:
# Node retrieval with property filtering
# Fetch articles published after a specific date

query = """
MATCH (a:Article)-[]-(ud:UpdateDate)
WHERE date(ud.update_date).year = 2007
RETURN a.title, ud.update_date
LIMIT 4
"""

graph.query(query)

[{'a.title': 'Reconstruction of Gray-scale Images',
  'ud.update_date': neo4j.time.Date(2007, 7, 2)},
 {'a.title': 'Finite-Dimensional Crystals B^{2,s} for Quantum Affine Algebras of type\n  D_{n}^{(1)}',
  'ud.update_date': neo4j.time.Date(2007, 10, 8)},
 {'a.title': 'On nonparametric maximum likelihood for a class of stochastic inverse\n  problems',
  'ud.update_date': neo4j.time.Date(2007, 10, 8)},
 {'a.title': 'On the strong consistency of asymptotic M-estimators',
  'ud.update_date': neo4j.time.Date(2007, 10, 8)}]

In [None]:
# Fetch 10 articles and their authors published in a specific journal

query = """
MATCH (j:Journal {name: "Commun.Math.Phys"})<-[:PUBLISHED_IN]-(a:Article)-[:WRITTEN_BY]-(au:Author)
RETURN a.title, COLLECT(au.last_name + ', ' + au.first_name) AS authors
LIMIT 10
"""
graph.query(query)


[{'a.title': 'Hyper-K{\\"a}hler Hierarchies and their twistor theory',
  'authors': ['Dunajski, Maciej', 'Mason, Lionel J.']},
 {'a.title': '$A_{\\infty}$-structures on an elliptic curve',
  'authors': ['Polishchuk, Alexander']},
 {'a.title': 'Superselection Theory for Subsystems',
  'authors': ['Conti, Roberto', 'Doplicher, Sergio', 'Roberts, John E.']},
 {'a.title': 'Geometrical Tools for Quantum Euclidean Spaces',
  'authors': ['Cerchiai, B. L.', 'Fiore, G.', 'Madore, J.']},
 {'a.title': 'Classification of Subsystems for Local Nets with Trivial Superselection\n  Structure',
  'authors': ['Conti, Roberto', 'Carpi, Sebastiano']},
 {'a.title': 'Notes for a Quantum Index Theorem',
  'authors': ['Longo, Roberto']},
 {'a.title': 'A New Cohomology Theory for Orbifold',
  'authors': ['Chen, Weimin', 'Ruan, Yongbin']},
 {'a.title': 'Log mirror symmetry and local mirror symmetry',
  'authors': ['Takahashi, Nobuyoshi']},
 {'a.title': 'Quantum Affine (Super)Algebras $U_q(A_{1}^{(1)})$ and $U_q(

In [None]:
# Fetch all authors who wrote a particular article

query = """
MATCH (a:Author)<-[:WRITTEN_BY]-(art:Article {article_id: 1008})
RETURN a.last_name, a.first_name
"""

graph.query(query)

[{'a.last_name': 'Dunajski', 'a.first_name': 'Maciej'},
 {'a.last_name': 'Mason', 'a.first_name': 'Lionel J.'}]

In [None]:
# Find the journals in which an author's articles were published

query = """
MATCH path = (a:Author {last_name: "Warnaar"})-[]-(p:Article)-[]-(j:Journal)
RETURN j.name
"""
graph.query(query)

[{'j.name': 'Constructive Approximation'},
 {'j.name': 'J.Statist.Phys'},
 {'j.name': 'Discrete Mathematics'},
 {'j.name': 'Commun. Math. Phys'},
 {'j.name': ''}]

In [None]:
# Relationships with property filtering
# Fetch articles written by a specific author and published after a certain date

query= """
MATCH (a:Author {last_name: "Schick"})-[]-(art:Article)-[]-(ud:UpdateDate)
WHERE ud.update_date > "2000-01-01"
RETURN art.title, ud.update_date
"""
graph.query(query)

[{'art.title': "A K-Theoretic Proof of Boutet de Monvel's Index Theorem for Boundary\n  Value Problems",
  'ud.update_date': '2007-05-23'},
 {'art.title': 'Finite group extensions and the Baum-Connes conjecture',
  'ud.update_date': '2014-11-11'},
 {'art.title': 'On a conjecture of Atiyah', 'ud.update_date': '2015-06-26'},
 {'art.title': 'Integrality of L2-Betti numbers',
  'ud.update_date': '2018-11-28'},
 {'art.title': 'Manifolds with boundary and of bounded geometry',
  'ud.update_date': '2018-11-28'},
 {'art.title': 'Approximating L2-invariants, and the Atiyah conjecture',
  'ud.update_date': '2018-11-28'},
 {'art.title': 'The spectral measure of certain elements of the complex group ring of a\n  wreath product',
  'ud.update_date': '2018-11-28'},
 {'art.title': 'Approximating L^2-signatures by their compact analogues',
  'ud.update_date': '2018-11-28'},
 {'art.title': 'Approximating Spectral invariants of Harper operators on graphs II',
  'ud.update_date': '2018-11-28'},
 {'art.ti

In [None]:
# Multiple paths
# Find authors who have written articles for a specific journal

query = """
MATCH (a:Author)-[]-(:Article)-[]-(j:Journal)
WHERE j.name CONTAINS "Comm"
RETURN DISTINCT a.last_name AS Name
LIMIT 10
"""
graph.query(query)

[{'Name': 'Gioev'},
 {'Name': 'Coriasco'},
 {'Name': 'Schrohe'},
 {'Name': 'Seiler'},
 {'Name': 'Barles'},
 {'Name': 'Ley'},
 {'Name': 'Mangoubi'},
 {'Name': 'Dunajski'},
 {'Name': 'Mason'},
 {'Name': 'Polishchuk'}]

In [None]:
# Combining Aggregations and Paths
# Find the journal that has published the most articles:

query = """
    MATCH (j:Journal)-[]-(a:Article)
    WHERE j.name <> ''
    RETURN j.name AS Journal, COUNT(a) AS NumberArticles
    ORDER BY NumberArticles DESC
    LIMIT 2
    """
graph.query(query)

[{'Journal': 'Algebr. Geom. Topol', 'NumberArticles': 344},
 {'Journal': 'Geom. Topol', 'NumberArticles': 285}]

In [None]:
# Complex Aggregations with Filtering
# Find authors who have written more than 5 articles and at
# least one of those articles was published in the "Topology" journal:

query = """
MATCH (a:Author)<-[:WRITTEN_BY]-(art:Article)
WITH a, COUNT(art) AS ArticleCount
WHERE ArticleCount > 5
MATCH (a)<-[:WRITTEN_BY]-(:Article)-[:PUBLISHED_IN]->(j:Journal)
WHERE j.name CONTAINS 'Topology'
RETURN a.last_name AS LastName, a.first_name AS FirstName, ArticleCount, j.name AS Journal
"""
pd.DataFrame(graph.query(query))

Unnamed: 0,LastName,FirstName,ArticleCount,Journal
0,Christensen,J. Daniel,7,Topology
1,Christensen,J. Daniel,7,Topology
2,Suciu,Alexander I.,18,Topology
3,Suciu,Alexander I.,18,Topology and Appl
4,Kotschick,D.,20,Topology
5,Feehan,Paul M. N.,7,Topology and its Applications
6,Meyer,Ralf,13,Topology
7,Meyer,Ralf,13,Topology
8,Tsaban,Boaz,17,Topology and its Applications
9,Tsaban,Boaz,17,Topology and its Applications
