# Day1: Graph Analysis of the Citation Graph of COVID Scientific Literature

> Pestryakova, S., Vollmers, D., Sherif, M.A. et al. 
> CovidPubGraph: A FAIR Knowledge Graph of COVID-19 Publications. Sci Data 9, 389 (2022). 
> https://doi.org/10.1038/s41597-022-01298-2

## BY: _______________________

## Scope and purpose:  Explorative Data Analysis

This notebook will show an initial analysis of some aspects of the data contained in the above citation graph.

The dataset is distributed in RDF format and in multiple snapshot at different points in time.

**Some snapshots** of the dataset has been loaded into a SPARQL endpoint at `http://192.225.39.123:8888/sparql`

The snapshot will be analyzed by extracting data via SPARQL queries and analyzing portions of it with libraries and tools like `networkx`.

## Imports and Utility Functions

In [None]:
# IF PACKAGES NEED TO BE INSTALLED
# import sys

# # Install a conda package in the current Jupyter kernel
# !conda install --yes --prefix {sys.prefix} networkx SPARQLWrapper

# print("Done!")

In [None]:
import os
import glob
import string
import numpy as np
import pandas as pd
from pandas import json_normalize

import networkx as nx
import matplotlib.pyplot as plt
from SPARQLWrapper import SPARQLWrapper, JSON
from SPARQLWrapper.SPARQLExceptions import EndPointInternalError


pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

endpoint = SPARQLWrapper('http://191.225.39.123:8888/sparql')
endpoint.setReturnFormat(JSON)
endpoint.setTimeout(1200)
endpoint.method = 'POST'


def run_query( query, endpoint, as_dataframe=False, do_print=False):
    # Utility function 
    
    PREFIX= """
    PREFIX bibo: <http://purl.org/ontology/bibo/> 
    PREFIX bibtex: <http://purl.org/net/nknouf/ns/bibtex#> 
    PREFIX cvdo: <https://covid-19ds.data.dice-research.org/ontology/> 
    PREFIX cvdr: <https://covid-19ds.data.dice-research.org/resource/> 
    PREFIX dbo: <https://dbpedia.org/ontology/> 
    PREFIX dcterms: <http://purl.org/dc/terms/> 
    PREFIX fabio: <http://purl.org/spar/fabio/> 
    PREFIX foaf: <http://xmlns.com/foaf/0.1/> 
    PREFIX its: <http://www.w3.org/2005/11/its/rdf#> 
    PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> 
    PREFIX owl: <http://www.w3.org/2002/07/owl#> 
    PREFIX prov: <http://www.w3.org/ns/prov#> 
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
    PREFIX schema: <http://schema.org/> 
    PREFIX sdo: <http://salt.semanticauthoring.org/ontologies/sdo#> 
    PREFIX swc: <http://data.semanticweb.org/ns/swc/ontology#> 
    PREFIX vcard: <http://www.w3.org/2006/vcard/ns#> 
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 
    
    """
    
    try:
        endpoint.setQuery(PREFIX+query)
        results = endpoint.query().convert()
        results = results['results']
        if len(results['bindings']) <= 0:
            print("Empty resultset")
            return None

        if not as_dataframe:
            if do_print:
                for binding in results['bindings']:    
                    print("; ".join([var+": "+ binding[var]['value']  for var in binding.keys()  ]))
            return results['bindings']

        else:
            pdata = json_normalize(results['bindings'])
            if do_print:
                display(pdata)
            return pdata
    except EndPointInternalError  as e :
        print("Could not complete request!")
        return None

print("ready!")

## Available snapshots

```
https://covid-19ds.data.dice-research.org/2020-09-23
https://covid-19ds.data.dice-research.org/2020-11-14
https://covid-19ds.data.dice-research.org/2020-12-07
https://covid-19ds.data.dice-research.org/2021-03-11
https://covid-19ds.data.dice-research.org/2021-11-14
```

In [None]:
query = """

SELECT  (COUNT(*) as ?numTriples)
FROM <https://covid-19ds.data.dice-research.org/2020-09-23>
WHERE {   
?s ?p ?o
}

"""

results = run_query(query, endpoint, as_dataframe=True, do_print=False)
display(results)

In [None]:
query = """

SELECT  (COUNT(*) as ?numTriples)
FROM <https://covid-19ds.data.dice-research.org/2021-11-14>
WHERE {   
?s ?p ?o
}

"""

results = run_query(query, endpoint, as_dataframe=True, do_print=False)
display(results)

## Exploratory Analysis

### Summarize node types and their prevalence in the graph

In [None]:
query = """
SELECT ?c COUNT(?s)
FROM <https://covid-19ds.data.dice-research.org/2020-09-23>
WHERE { 
?s a ?c .
} 

GROUP BY ?c
"""

results = run_query(query, endpoint, as_dataframe=True, do_print=True)

### Summarize edge types and their prevalence in the graph

In [None]:
query = """

SELECT ?p COUNT(?p)
FROM <https://covid-19ds.data.dice-research.org/2020-09-23>
WHERE { 
?s ?p ?o .
} 

GROUP BY ?p
"""

results = run_query(query, endpoint, as_dataframe=True, do_print=True)

### Additional analysis ...?

1. **TODO: Run one additional query** that helps you understand the size and scope of the dataset

2. **REFLECT: Is there anything interesting/surprising/unexpected** from the results of the queries above?

### Find the errors/mismatch in the ER schema presented in the original paper

1. **TODO: Select a subset of Fig1** in the original paper. **Identify any mismatch** between the schema presented there and the content of the dataset
2. **TODO: Produce an updated schema for the subset you studied**

In [None]:
### 

### Find the top 20 authors with the largerst number of papers in the collection

In [None]:
query = """

SELECT ?s COUNT(?o)
FROM <https://covid-19ds.data.dice-research.org/2020-09-23>
WHERE { 
?s a <http://ma-graph.org/class/Author> .
?o <http://purl.org/net/nknouf/ns/bibtex#hasAuthor> ?s .
} 
GROUP BY ?s
ORDER BY DESC(COUNT(?o))
LIMIT 20
"""

results = run_query(query, endpoint, as_dataframe=True, do_print=True)

### Find the top 20 papers with the largerst number of referenced papers in the collection

1. **TODO: check** the initial query below
2. **TODO: update the query** so that it retrieves the required papers
3. **REFLECT: are you able to locate** on an online repository/library at least one of the papers returned? Does the data checks out?

In [None]:
## The following query retrieves all the references for a signle paper, you can use it as inspiration
query = """

SELECT DISTINCT ?paper1 ?paper2   
FROM <https://covid-19ds.data.dice-research.org/2020-09-23>
WHERE {

VALUES ?paper1 { <https://covid-19ds.data.dice-research.org/resource/pmc1065257> }

?paper1 cvdo:hasBody ?body .
?body cvdo:hasSection ?section .
?refCont nif:referenceContext ?section .
?refCont its:taIdentRef ?entry .
?entry a bibtex:Entry .
?entry owl:sameAs ?paper2 .
?paper2 a  <http://schema.org/ScholarlyArticle> .

}
"""
results = run_query(query, endpoint, as_dataframe=True, do_print=True)


### The Citation Network: loading Paper-cites->Paper subnetwork

**In the following:**

- We extract the `paper-[cites]->paper` edges
- We build a networkx Directed Graph
- **TODO:** check out the tutorial of networkx here: [networkx.org/.../tutorial.html](https://networkx.org/documentation/stable/tutorial.html)
- We extract the connected components and find the size of the largest one
- **TODO:** The full citation graph is too big to be returned... we will need a smart solution
- **REFLECT:** Decide how to edit the query or the code to obtain a reliable (even if imprecise/incomplete) result
- **TODO:** Analyze the graph obtained, especially focusing on the largest connected component (see below)







In [None]:
query = """

SELECT DISTINCT ?paper1 ?paper2   
FROM <https://covid-19ds.data.dice-research.org/2020-09-23>

WHERE {

?paper1 cvdo:hasBody ?body .
?body cvdo:hasSection ?section .
?refCont nif:referenceContext ?section .
?refCont its:taIdentRef ?entry .
?entry a bibtex:Entry .
?entry owl:sameAs ?paper2 .
?paper2 a  <http://schema.org/ScholarlyArticle> .

}
LIMIT 20000

"""

results = run_query(query, endpoint, as_dataframe=False, do_print=False)

len(results)

In [None]:
G = nx.DiGraph()
for binding in results:
    p1 = binding['paper1']['value']
    p2 = binding['paper2']['value']
    G.add_edge(p1, p2, label='cites')
    
print(nx.number_weakly_connected_components(G))

largest_cc = max(nx.weakly_connected_components(G), key=len)
len(largest_cc)

In [None]:
# NOTE: some metric requires to foucs on weakly connected components
largest_cc = max(nx.weakly_connected_components(G), key=len)

H = G.subgraph(largest_cc)
print(nx.density(H))
print(nx.diameter(H.to_undirected()))
np.median(nx.degree_histogram(H))

#### Node Centrality:

- **TODO: Pick two nodes with degree > 2** in the same component
- **TODO/REFLECT: Compare** their centrality with their number of citation and authors

## Author H-Index

- **TODO: Take the largest** connected component (LCC)
- **TODO: Find the author of a paper** in that (LCC)
- **TODO: Compute it's H-index** based on citations in the graph
- (**Optional**: Can you compare the author page rank  with its h-index with its centrality)