# Task 1: Define hand-crafted SPARQL queries (based on “horn-like” rules) for each relation p above to insert the missing relations \<s> \<p> \<o> into the given graph.


In [1]:
# !pip install rdflib ## uncomment this if you do not install rdflib
import rdflib
import copy # to copy graphs

g = rdflib.Graph()
g.parse("https://raw.githubusercontent.com/FHCampbell71/KG/main/Assignment_2/train.nt",format="nt") #load local rdf file

<Graph identifier=N2e3f0055c64140a6ae783bbf949c4dab (<class 'rdflib.graph.Graph'>)>

Then we can run SPARQL query over local rdf file

## Guess 1: missing hasCoauthor property based relations

If 2 authors are both creators of the same paper, then they have a coauthor relation.

coauthor(authorA, authorB) ⇐ creator(?somepaper, ?authorA) ∧ creator(?somepaper, ?authorB)


### query 1.1: count total coauthor relations

In [2]:
query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?authorA ?authorB
WHERE {
?somepaper <http://purl.org/dc/terms/creator> ?authorA .
?somepaper <http://purl.org/dc/terms/creator> ?authorB .
}
} """ ## give the SPARQL query


In [3]:
qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row}") ## print s p o from result, and it is 32511

(rdflib.term.Literal('32511', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')),)


### Query 1.2: count missing coauthor relations

In [4]:
# check the missing coauthor relations in this graph.
query = """SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?authorA ?authorB
WHERE {
?somepaper <http://purl.org/dc/terms/creator> ?authorA .
?somepaper <http://purl.org/dc/terms/creator> ?authorB .
filter not exists {?authorA <http://lsdis.cs.uga.edu/projects/semdis/opus#coauthor> ?authorB }.
}}
 """ ## give the SPARQL query

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 8306. 

# paper 1 has authorA and authorB >> coauthor in the graph
# paper 2 has authorA, authorB and authorC >> but no coauthor in the graph
# so it is reasonable that missing coauthor relations are more than (32511-24822)

8306 


### Query 1.3: Insert the rule
As the rule is certain so it has 100% confidence, the query can be inserted into the graph directly.

In [5]:
# Insert the missing coauthor relations in this graph.
query = """
INSERT { ?authorA <http://lsdis.cs.uga.edu/projects/semdis/opus#coauthor> ?authorB }
WHERE {
?somepaper <http://purl.org/dc/terms/creator> ?authorA .
?somepaper <http://purl.org/dc/terms/creator> ?authorB .
filter not exists {?authorA <http://lsdis.cs.uga.edu/projects/semdis/opus#coauthor> ?authorB }.
}
""" ## give the SPARQL query

# Copy the original graph to a new graph
gnew = copy.deepcopy(g)

# Execute the SPARQL query
gnew.update(query)

# Sanity check
print("Original graph triples count:", len(g))
print("New graph triples count:", len(gnew))

Original graph triples count: 89253
New graph triples count: 97559


## Guess 2: missing memberOf property based relations

If one author is member of an affiliation, then his coauthors may also be in the same affiliation 

memberOf(?authorB, ?someAffiliation) ⇐ memberOf(?authorA, ?someAffiliation) ∧ coauthor(?authorA, ?authorB)


### Query 2.1 count support

In [6]:
# query for support
query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?authorB ?someAffiliation
WHERE {
?authorB <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?authorA <http://lsdis.cs.uga.edu/projects/semdis/opus#coauthor> ?authorB .
?authorA <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
}
} """ 

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 4236. 

4236 


### Query 2.2 count body

In [7]:
# query for body coverage

query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?authorB ?someAffiliation
WHERE {
?authorA <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?authorA <http://lsdis.cs.uga.edu/projects/semdis/opus#coauthor> ?authorB .
?authorB <http://www.w3.org/ns/org#memberOf> ?someAffiliation_ . # use underscore here
} }
 """

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 10384. 

10384 


### Query 2.3 Insert rule
Here we give this rule a **medium** confidence score 4236/10384 = 40.79%.


In [8]:
# query for inserting rule with a confidence score 0.4079
query = """
INSERT { ?authorB <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
  ?authorB <http://example.org/confidence> ?someAffiliation .
  ?authorB <http://example.org/confidence> "0.4079"^^xsd:double .} 
WHERE {
?authorA <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?authorA <http://lsdis.cs.uga.edu/projects/semdis/opus#coauthor> ?authorB .
filter not exists {?authorB <http://www.w3.org/ns/org#memberOf> ?someAffiliation . } .
}
""" 

# Copy the original graph to a new graph
gnew = copy.deepcopy(g)

# Execute the SPARQL query
gnew.update(query)

# Sanity check
print("Original graph triples count:", len(g))
print("New graph triples count:", len(gnew))

Original graph triples count: 89253
New graph triples count: 112962


## Guess 3: Missing hasDiscipline property based relations
 if two papers appear in the same conference, then they may have the same domain.
 
 hasDiscipline(?paperB, ?someDomain) ⇐ hasDiscipline(?paperA, ?someDomain) ∧ appearsInConferenceSeries(?paperA, ?someConference) ∧ appearsInConferenceSeries(?paperB, ?someConference)

### Query 3.1: count support

In [9]:
# query for counting support
query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?paperB ?someDomain
WHERE {
?paperB <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
?paperA <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
}
} """ 

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 17776. 

15125 


### Query 3.2 count body

In [10]:
# query for counting body

query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?paperB ?someDomain
WHERE {
?paperA <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <http://purl.org/spar/fabio/hasDiscipline> ?someDomain_ . # with underscore
} }
 """

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 1337710. 


1337710 


### Query 3.3 Insert rule

Here I will give this rule a **low** confidence score 15125/1337710 = 1.13%

In [11]:
# query for inserting rule with a low confidence score 0.0113
query = """
INSERT { ?paperB <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
  ?authorB <http://example.org/confidence> ?someDomain .
  ?authorB <http://example.org/confidence> "0.0113"^^xsd:double .}
WHERE {
?paperA <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
filter not exists {?paperB <http://purl.org/spar/fabio/hasDiscipline> ?someDomain . }.
}
""" 

# Copy the original graph to a new graph
gnew = copy.deepcopy(g)

# Execute the SPARQL query
gnew.update(query)

# Sanity check
print("Original graph triples count:", len(g))
print("New graph triples count:", len(gnew))

Original graph triples count: 89253
New graph triples count: 1442275


## Guess 4: Missing appearsInConferenceSeries property based relations

If one paper cites another one, their authors are both the member of the same affiliation, then they may appear in the same conference series.

 appearsInConferenceSeries(?paperB, ?someConference) ⇐ cites(?paperA, ?paperB) ∧ creator(?paperA, ?authorA) ∧ creator(?paperB, ?authorB) ∧ memberOf(?authorA, someAffiliation) ∧ memberOf(?authorB, someAffiliation) ∧ appearsInConferenceSeries(?paperA, ?someConference)

### Query 4.1 count support

In [12]:
# query for counting support
query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?paperB ?someConference
WHERE {
?paperA <http://purl.org/spar/cito/cites> ?paperB .
?paperA <http://purl.org/dc/terms/creator> ?authorA .
?paperB <http://purl.org/dc/terms/creator> ?authorB .
?authorA <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?authorB <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
}
} """ 

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 390. 

390 


### Query 4.2 count body

In [13]:
# query for counting body
query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?paperB ?someConference
WHERE {
?paperA <http://purl.org/spar/cito/cites> ?paperB .
?paperA <http://purl.org/dc/terms/creator> ?authorA .
?paperB <http://purl.org/dc/terms/creator> ?authorB .
?authorA <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?authorB <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference_ . # use underscore
}
} """ 

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 926. 

926 


### Query 4.3 insert rule

Here I will give this rule a **high** confidence score 390/926 = 42.12%

In [14]:
# query for inserting rule with a high confidence score 0.7
query = """
INSERT { ?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
  ?paperB <http://example.org/confidence> ?someConference .
  ?paperB <http://example.org/confidence> "0.7"^^xsd:double .}
WHERE {
?paperA <http://purl.org/spar/cito/cites> ?paperB .
?paperA <http://purl.org/dc/terms/creator> ?authorA .
?paperB <http://purl.org/dc/terms/creator> ?authorB .
?authorA <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?authorB <http://www.w3.org/ns/org#memberOf> ?someAffiliation .
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
filter not exists {?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference . }.
}
""" 

# Copy the original graph to a new graph
gnew = copy.deepcopy(g)

# Execute the SPARQL query
gnew.update(query)

# Sanity check
print("Original graph triples count:", len(g))
print("New graph triples count:", len(gnew))

Original graph triples count: 89253
New graph triples count: 91173


## Guess 5: missing cites property based relations

If two papers have the same domain, and appear in the same conference, then one paper may cite the other paper.

cites(?paperA, ?paperB) ⇐ appearsInConferenceSeries(?paperA, ?someConference) ∧ appearsInConferenceSeries(?paperB, ?someConference) ∧ hasDiscipline(?paperA, ?someDomain) ∧ hasDiscipline(?paperB, ?someDomain)

### Query 5.1 count support

In [15]:
# query for counting support
query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?paperA ?paperB
WHERE {
?paperA <http://purl.org/spar/cito/cites> ?paperB .
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperA <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
?paperB <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
}
} """ 

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 1102. 

1102 


### Query 5.2 count body

In [16]:
# query for counting body
query = """
SELECT (count(*) as ?cnt) {
SELECT DISTINCT ?paperA ?paperB
WHERE {
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperA <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
?paperB <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
}
} """ 

qres = g.query(query) ## run SPARQL query over graph q
for row in qres: ## for each result
    print(f"{row.cnt} ") ## print cnt from result, and it is 204263. 

204263 


### Query 5.3 insert rule

Here I will give this rule a **low** confidence score 2%

In [17]:
# query for inserting rule with a low confidence score 0.02
query = """
INSERT { ?paperA <http://purl.org/spar/cito/cites> ?paperB .
  ?paperA <http://example.org/confidence> ?paperB .
  ?paperA <http://example.org/confidence> "0.02"^^xsd:double .}
WHERE {
?paperA <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperB <https://makg.org/property/appearsInConferenceSeries> ?someConference .
?paperA <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
?paperB <http://purl.org/spar/fabio/hasDiscipline> ?someDomain .
filter not exists {?paperA <http://purl.org/spar/cito/cites> ?paperB .}.
}
""" 

# Copy the original graph to a new graph
gnew = copy.deepcopy(g)

# Execute the SPARQL query
gnew.update(query)

# Sanity check
print("Original graph triples count:", len(g))
print("New graph triples count:", len(gnew))

Original graph triples count: 89253
New graph triples count: 499762
