# Day2: Graph Analysis of the Citation Graph of COVID Scientific Literature

> Pestryakova, S., Vollmers, D., Sherif, M.A. et al. 
> CovidPubGraph: A FAIR Knowledge Graph of COVID-19 Publications. Sci Data 9, 389 (2022). 
> https://doi.org/10.1038/s41597-022-01298-2

## BY: _______________________

## Scope and purpose: Compute Subgraph Frequencies

This notebook assumes a basic understanding of the contents of the dataset.

The dataset is distributed in RDF format and in multiple snapshot at different points in time.
**One snapshot** of the dataset has been loaded into a SPARQL endpoint.

The snapshot will be analyzed by extracting data via SPARQL queries and analyzing portions of it with libraries and tools like `networkx`.

## Imports and Utility Functions

In [None]:
EDNPOINT='http://192.225.39.123:8888/sparql'

In [None]:
import os
import glob
import string
import numpy as np
import pandas as pd
from pandas import json_normalize

import networkx as nx
import matplotlib.pyplot as plt
from SPARQLWrapper import SPARQLWrapper, JSON
from SPARQLWrapper.SPARQLExceptions import EndPointInternalError


pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

endpoint = SPARQLWrapper(EDNPOINT)
endpoint.setReturnFormat(JSON)
endpoint.setTimeout(1200)
endpoint.method = 'POST'


def run_query( query, endpoint, as_dataframe=False, do_print=False):
    # Utility function 
    
    PREFIX= """
    PREFIX bibo: <http://purl.org/ontology/bibo/> 
    PREFIX bibtex: <http://purl.org/net/nknouf/ns/bibtex#> 
    PREFIX cvdo: <https://covid-19ds.data.dice-research.org/ontology/> 
    PREFIX cvdr: <https://covid-19ds.data.dice-research.org/resource/> 
    PREFIX dbo: <https://dbpedia.org/ontology/> 
    PREFIX dcterms: <http://purl.org/dc/terms/> 
    PREFIX fabio: <http://purl.org/spar/fabio/> 
    PREFIX foaf: <http://xmlns.com/foaf/0.1/> 
    PREFIX its: <http://www.w3.org/2005/11/its/rdf#> 
    PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> 
    PREFIX owl: <http://www.w3.org/2002/07/owl#> 
    PREFIX prov: <http://www.w3.org/ns/prov#> 
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
    PREFIX schema: <http://schema.org/> 
    PREFIX sdo: <http://salt.semanticauthoring.org/ontologies/sdo#> 
    PREFIX swc: <http://data.semanticweb.org/ns/swc/ontology#> 
    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#> 
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 
    
    """
    
    try:
        endpoint.setQuery(PREFIX+query)
        results = endpoint.query().convert()
        results = results['results']
        if len(results['bindings']) <= 0:
            print("Empty resultset")
            return None

        if not as_dataframe:
            if do_print:
                for binding in results['bindings']:    
                    print("; ".join([var+": "+ binding[var]['value']  for var in binding.keys()  ]))
            return results['bindings']

        else:
            pdata = json_normalize(results['bindings'])
            if do_print:
                display(pdata)
            return pdata
    except EndPointInternalError  as e :
        print("Could not complete request!")
        return None

print("ready!")

## Available snapshots

```
https://covid-19ds.data.dice-research.org/2020-12-07
https://covid-19ds.data.dice-research.org/2021-03-11
https://covid-19ds.data.dice-research.org/2021-11-14
```

In [None]:
# Test query to make sure connection works

query = """

SELECT (COUNT(*) as ?numTriples)
FROM <https://covid-19ds.data.dice-research.org/2021-11-14>
WHERE { 
?s ?p ?o .
} 

"""

results = run_query(query, endpoint, as_dataframe=False, do_print=True)

## Citation Graph analysis by Journal

### Extract a labelled graph for publications and their citations

 1. Use a SPARQL query to extract simple connections across papers. Here we use predicates as edge labels and journal name (not node URIs/IDs) as node labels! The graph is built in networkx.

 2. **REFLECT:** Do you understand all steps of the process? Add comments to the main parts of the code

In [None]:
## The following query retrieves all the references for a single paper, you can use it as inspiration
query = """

SELECT ?paper1 ?paper2 ?journal1 ?journal2   
FROM <https://covid-19ds.data.dice-research.org/2021-11-14>
WHERE {

?paper1 cvdo:hasBody ?body .
?paper1 bibtex:hasJournal ?journal1 .
?body cvdo:hasSection ?section .
?refCont nif:referenceContext ?section .
?refCont its:taIdentRef ?entry .
?entry a bibtex:Entry .
?entry owl:sameAs ?paper2 .
?paper2 a  <http://schema.org/ScholarlyArticle> .
?paper2 bibtex:hasJournal ?journal2 .

}
"""
results = run_query(query, endpoint, as_dataframe=False, do_print=False)


CitGraph = nx.DiGraph()

node_journals = {}
edges = set()

for binding in results:
    p1 = binding['paper1']['value'].replace('https://covid-19ds.data.dice-research.org/resource/', 'res:')
    j1 = binding['journal1']['value'].replace('https://covid-19ds.data.dice-research.org/resource/', 'res:')
    p2 = binding['paper2']['value'].replace('https://covid-19ds.data.dice-research.org/resource/', 'res:')
    j2 = binding['journal1']['value'].replace('https://covid-19ds.data.dice-research.org/resource/', 'res:')
    node_journals[p1]= j1
    node_journals[p2]= j2

    edges.add((p1, 'references', p2))

for n in node_journals.keys():
    # simplify node labels -- note this method is a bit naive
    node_journals[n] =  node_journals[n].encode('ascii', 'replace').decode().replace(' ','')[0:7]    

for n, t in node_journals.items():
    CitGraph.add_node(n, label=t)
    
for e in edges:            
    CitGraph.add_edge(e[0], e[2], label=e[1])
    
print(len(CitGraph))

### Label frequency analysis

1. **TODO** How frequent is each node label? This part of the code is missing...

2. **REFLECT** How frequent is each pair of node labels? Compare top 20 pairs with top 1 and 2 single node label frequencies. Can you identify the online webpage of the actual journal?

3. **REFLECT** Add comments to the code and explain what common/uncommon behaviour can be derived by the final output

In [None]:
journals = set( node_journals.values())
print(len(journals))

## Add here code to compute how frequent each single node label is
#
# ....
#
###

## Here computes for the pairs of labels
pairs_frequency = {}

for e in CitGraph.edges():
    j_pair = (node_journals[e[0]], node_journals[e[1]])
    pairs_frequency[j_pair] = pairs_frequency.get(j_pair, 0) + 1

len(pairs_frequency)
    
pairs_frequency_list = sorted(pairs_frequency.items(), key=lambda x: x[1], reverse=True)
pairs_frequency_list[0:15]

### Subgraph Search

- **REFLECT: Toy example** See the code below, verify the output is what you expect

In [None]:
# Toy example

Gt = nx.DiGraph()
Gt.add_node('1', label='Viruses')
Gt.add_node('2', label='Viruses')
Gt.add_edge('1', '2', label='references')

Gt.add_node('3', label='Viruses')
Gt.add_node('4', label='Viruses')
Gt.add_edge('1', '3', label='references')
Gt.add_edge('3', '4', label='references')
Gt.add_edge('4', '1', label='extends')

Gt.add_node('5', label='PLoSOne')
Gt.add_edge('5', '2', label='extends')
Gt.add_edge('5', '3', label='extends')

In [None]:
node_label_dict = { k: "{}:{}".format(k,v) for k,v in nx.get_node_attributes(Gt,"label").items() }
edge_label_dict = nx.get_edge_attributes(Gt,"label")
        
pos = nx.shell_layout(Gt)
nx.draw_networkx_nodes(Gt, pos)
nx.draw_networkx_edges(Gt, pos)
nx.draw_networkx_labels(Gt, pos, node_label_dict )
_ = nx.draw_networkx_edge_labels(Gt, pos, edge_label_dict)


In [None]:
# We create a graph query and check for subgraph isomorphism

Q = nx.DiGraph()
Q.add_node(10, label='Viruses')
Q.add_node(20, label='Viruses')
Q.add_node(30, label='PLoSOne')
Q.add_edge(10, 20, label='references')
Q.add_edge(30, 10, label='extends')


GM = nx.algorithms.isomorphism.GraphMatcher(Gt,Q , node_match=nx.algorithms.isomorphism.categorical_node_match(['label'],['label','']))
for subgraph in GM.subgraph_isomorphisms_iter():
    print(subgraph)


 **REFLECT: Explain the code below and its output** what is the oputput telling us? 
 
 Note: here we are querying the full CitationGraph we built

In [None]:
Q = nx.DiGraph()
Q.add_node(1, label='PLoSOne')
Q.add_node(2, label='PLoSOne')
Q.add_node(3, label='PLoSOne')
Q.add_edge(1, 2, label='references')
Q.add_edge(3, 1, label='references')
Q.add_edge(3, 2, label='references')

max_iter = 10
GM = nx.algorithms.isomorphism.GraphMatcher(CitGraph,Q, node_match=nx.algorithms.isomorphism.categorical_node_match(['label'],['label','']))
for subgraph in GM.subgraph_isomorphisms_iter():
    print(subgraph)
    max_iter-=1
    if max_iter ==0:
        print('stop')
        break

**TODO/REFLECT:** check on the slides the _all non-isomorphic, connected, directed graphltes of size 3_ Which of those we do not expect to find in our citation graph? Why? Check some of them. 

Note: to do that we need to drop isomorphism on node labels....

Below, some more code examples

**Let's focus on some connected component**

In [None]:
Gcc = sorted(nx.weakly_connected_components(CitGraph), key=len, reverse=True)
print(len(Gcc))
GX = CitGraph.subgraph(Gcc[1]) # 2nd connected component by size, simple to visualize
print(len(GX))
print(GX.edges)



node_label_dict = {n:node_journals[n] for n in GX.nodes }
#edge_label_dict = nx.get_edge_attributes(GX,"label")

pos = nx.shell_layout(GX)
nx.draw_networkx_nodes(GX, pos)
nx.draw_networkx_edges(GX, pos)
_ = nx.draw_networkx_labels(GX, pos, node_label_dict )
#_ = nx.draw_networkx_edge_labels(GX, pos, edge_label_dict)


In [None]:
# We build a query and search in the entire graph (not only the conn. comp.)
Q = nx.DiGraph()
Q.add_node(1, label='SciRep')
Q.add_node(2, label='SciRep')
Q.add_node(3, label='PLoSOne')
Q.add_edge(1, 2, label='references')
Q.add_edge(1, 3, label='references')


max_iter = 10
GM = nx.algorithms.isomorphism.GraphMatcher(CitGraph,Q, node_match=nx.algorithms.isomorphism.categorical_node_match(['label'],['label','']))
for subgraph in GM.subgraph_isomorphisms_iter():
    print(subgraph)
    max_iter-=1
    if max_iter ==0:
        print('stop')
        break

### Compute Pattern Frequency

- **TODO:** Pick one 4-node graphlet from the slides, **can you find a frequent pattern** of that shape? Use a valid notion of frequency

- **TODO:** Build a new graph: co-authorship graph, where two authors are connected if they wrote a paper together
- **TODO:** Add to the co-authorship graph new additional edges (with a different label) between two auhors if they come from the same institution
- **TODO:** Identify the largest connected component and count the triangles found in it
- **TODO/REFLECT:** Apply hierarchical clustering using networkx `girvan_newman`, what's the largest community found?

In [None]:
## The following query retrieves all the authors for each paper that has a body and a journal
query = """

SELECT DISTINCT ?paper1 ?author1  ?journal1 
FROM <https://covid-19ds.data.dice-research.org/2020-12-07>
WHERE {

?paper1 cvdo:hasBody ?body .
?paper1  bibtex:hasAuthor ?author1 .
?paper1 bibtex:hasJournal ?journal1 .

}
"""
results_authors = run_query(query, endpoint, as_dataframe=False, do_print=False)
len(results_authors)

### Graph Sampling

- **TODO Use SPARQL to query** the original graph and **produce a subsampled** version of the snapshot at`2020-12-07`
- **TODO Estimate** two measures of your choice, 
- **TODO apply the same sampling** for the snapshot at `2021-11-14`
- **TODO verify** if the estimate is still true
- **REFLECT:**  can you compute the true number? What are your conclusions?