# Access to endpoint and data sources CORD19-NEKG
### maintaned by WIMMICS team

This notebook is intended to be used as a client to the dataset called CORD-19 Named Entities Knowledge Graph (CORD19-NEKG): it describes the named entities identified in the 47,000+ articles provided by the COVID-19 Open Research Dataset (CORD-19).

For now, the named entities published are those identified by DBpedia Spotlight (DBpedia URIs) and entity-fishing (Wikidata URIs) in articles' titles and abstracts.

You can query the dataset from our Virtuoso endpoint: https://covid19.i3s.unice.fr/sparql. Here are the relevant named graphs:
    http://ns.inria.fr/covid19/graph/metadata: dataset description + definition of a few properties
    http://ns.inria.fr/covid19/graph/articles: articles metadata (title, authors, DOIs, journal etc.)
    http://ns.inria.fr/covid19/graph/dbpedia-spotlight: named entities identified by DBpedia Spotlight => 1,835,902 named entities
    http://ns.inria.fr/covid19/graph/entityfishing: named entities identified by Entity-fishing => 790,922 named entities

The dataset is described here: https://github.com/Wimmics/cord19-nekg
More specifically, you shall find details on how named entities are represented in RDF in that page: https://github.com/Wimmics/cord19-nekg/blob/master/doc/01-data-modeling.md


In [None]:
from __future__ import print_function

## Install required packages

#### SPARQLWrapper 

This package helps to convert service output to a Pandas DataFrame. https://rdflib.dev/sparqlwrapper/

#### Pandas

Using Pandas DataFrame to contain the query results.


NOTE: if you are runing Anaconda distribution the prefered way to install packages:

_conda install -c conda-forge sparqlwrapper_

_conda install pandas_

Only run it once or periodically to check for the updates.

In [None]:
!pip install pandas

In [None]:
!pip install SPARQLWrapper

In [303]:
import pandas as pd
print('Pandas ver.', pd.__version__)

import SPARQLWrapper
import json
print('SPARQLWrapper ver.', SPARQLWrapper.__version__)

from SPARQLWrapper import SPARQLWrapper, JSON

Pandas ver. 1.0.2
SPARQLWrapper ver. 1.8.2


In [304]:
def sparql_service_to_dataframe(service, query):
    """
    Helper function to convert SPARQL results into a Pandas DataFrame.
    
    Credit to Ted Lawless https://lawlesst.github.io/notebook/sparql-dataframe.html
    """
    sparql = SPARQLWrapper(service)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)


# Run queries

In [331]:
wds_Corese_Covid = 'https://covid19.i3s.unice.fr/sparql'

In [385]:
# Select articles with a reference to "coronavirus" 
query = '''
SELECT (group_concat(distinct ?name,", ") AS ?authors)
       ?title 
       (year(?date) as ?year)
       ?pub
       ?url
       #(sample(?) as ?annotation)
WHERE {
    graph <http://ns.inria.fr/covid19/graph/articles>
    {
        ?doc a ?t;
            dce:creator ?name;
            dct:title ?title;
            schema:publication ?pub;
            schema:url ?url;
            dct:abstract [ rdf:value ?abs ].

        optional { ?doc dct:issued ?date }
        filter contains(?abs, "asymptomatic infection")
    }
    
    #graph <http://ns.inria.fr/covid19/graph/entityfishing> {
    #?a1 a oa:Annotation;
    #    schema:about ?doc;
    #    dct:subject ?an;
    #    oa:hasBody ?uri.
  #}
} 
group by ?doc ?title ?date ?pub ?url
order by desc(?date)
limit 10
	
'''

In [386]:
%time df = sparql_service_to_dataframe(wds_Corese_Covid, query)
print(df.shape)

Wall time: 1.76 s
(10, 5)


In [391]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.precision', 3)
pd.set_option('display.max_rows', 9999)

df.head()

Unnamed: 0,authors,title,year,pub,url
0,"Chen, Shey-Ying, Hsueh, Po-Ren, King, Chwan-Chuen, Schwartz, Jonathan, Yang, Guang-Yang, Yen, Muh-Yong",Interrupting COVID-19 transmission by implementing enhanced traffic control bundling: Implications for global prevention and control efforts,2020,"Journal of Microbiology, Immunology and Infection",https://doi.org/10.1016/j.jmii.2020.03.011
1,"Chang, Le, Wang, Lunan, Yan, Ying",Coronavirus Disease 2019: Coronaviruses and Blood Safety,2020,Transfusion Medicine Reviews,https://doi.org/10.1016/j.tmrv.2020.02.003
2,"Ho, Yi-Jung, Lai, Zheng-Zong, Lu, Jeng-Wei",Cephalotaxine inhibits Zika infection by impeding viral replication and stability,2020,Biochemical and Biophysical Research Communications,https://doi.org/10.1016/j.bbrc.2019.12.012
3,"Alenazi, Thamer",Severe Middle East Respiratory Syndrome (MERS) Pneumonia,2019,Reference Module in Biomedical Sciences,https://doi.org/10.1016/b978-0-12-801238-3.11488-6
4,"Dong, Chao, Long, Ting, Pan, Yaohui, Shi, Tianyu, Yin, Qiuju, Zhang, Wensi",Effects of asymptomatic infection on the dynamical interplay between behavior and disease transmission in multiplex networks,2019,Physica A: Statistical Mechanics and its Applications,https://doi.org/10.1016/j.physa.2019.04.266


In [440]:
query_corona_vs_cancer = '''
prefix oa:     <http://www.w3.org/ns/oa#>
prefix schema: <http://schema.org/>
prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#>

prefix wd:     <http://www.wikidata.org/entity/>
prefix wdt:    <http://www.wikidata.org/prop/direct/>

# wdt:P279 = subclass of 
# wdt:P31 = instance of
# wd:Q12078 = cancer
# wd:Q1134583 = coronavirus family = Coronaviridae

select distinct ?article ?dis1 ?dis1Label ?dis2 ?dis2Label #?dis2Subject

from <http://ns.inria.fr/covid19/graph/entityfishing>
from named <http://ns.inria.fr/covid19/graph/wikidata-named-entities>

where {
    # Look for 2 annotations of the same article with Wikidata URIs ?dis1 and ?dis2 

    ?annot1 schema:about ?article; oa:hasBody ?dis1.
    ?annot2 schema:about ?article; oa:hasBody ?dis2.
 
    graph <http://ns.inria.fr/covid19/graph/wikidata-named-entities>
    {
      ?entity1 rdfs:label "cancer"@en. # ?entity1 is wd:Q12078
      
      { ?dis1 rdfs:label ?dis1Label.
        filter (?dis1 = ?entity1) } # ?dis1 is "cancer"

      UNION

      { ?dis1 wdt:P279 ?entity1;
              rdfs:label ?dis1Label. }  # ?dis1 is a subclass of "cancer" (at any depth)

      UNION

      { ?dis1 wdt:P31 ?entity1; 
              rdfs:label ?dis1Label. }  # ?dis1 is an instance of "cancer" or a subclass thereof



      ?entity2 rdfs:label "Coronaviridae"@en. # ?entity2 is wd:Q1134583

      { ?dis2 rdfs:label ?dis2Label. 
      filter (?dis2 = ?entity2) }

      UNION

      { ?dis2 wdt:P279 ?entity2;
              rdfs:label ?dis2Label. } # ?dis2 is a subclass of "Coronaviridae" (at any depth)

      UNION

      { ?dis2 wdt:P31 ?entity2; 
              rdfs:label ?dis2Label. }  # ?dis2 is an instance of "Coronaviridae" or a subclass thereof

    }

    
}
order by ?dis1 ?dis2
limit 1000
'''

In [441]:
%time df = sparql_service_to_dataframe(wds_Corese_Covid, query_Franck)
print(df.shape)

Wall time: 30.2 s
(117, 5)


In [442]:
df#.loc[:, ['dis1Label', 'dis2Label']]

Unnamed: 0,article,dis1,dis1Label,dis2,dis2Label
0,http://ns.inria.fr/covid19/3dd399d305426e614eae51dd03fb48fa86c4d324,http://www.wikidata.org/entity/Q1148337,hepatocellular carcinoma,http://www.wikidata.org/entity/Q16983356,Human coronavirus 229E
1,http://ns.inria.fr/covid19/c6a72d0b6a1510ac7ecb3e16179d9784990f6314,http://www.wikidata.org/entity/Q1148337,hepatocellular carcinoma,http://www.wikidata.org/entity/Q278567,severe acute respiratory syndrome-related coronavirus
2,http://ns.inria.fr/covid19/7b620ef79c42118f157be2d9ab586fbd615b9fc6,http://www.wikidata.org/entity/Q1148337,hepatocellular carcinoma,http://www.wikidata.org/entity/Q278567,severe acute respiratory syndrome-related coronavirus
3,http://ns.inria.fr/covid19/3dd399d305426e614eae51dd03fb48fa86c4d324,http://www.wikidata.org/entity/Q1148337,hepatocellular carcinoma,http://www.wikidata.org/entity/Q290805,Coronavirus
4,http://ns.inria.fr/covid19/c1e82ab3646382fd09d4e38c7041f89466f5028d,http://www.wikidata.org/entity/Q1148337,hepatocellular carcinoma,http://www.wikidata.org/entity/Q290805,Coronavirus
5,http://ns.inria.fr/covid19/ddba9808c2e5a41e0a27996ff5b59e4c09ae159a,http://www.wikidata.org/entity/Q1148337,hepatocellular carcinoma,http://www.wikidata.org/entity/Q290805,Coronavirus
6,http://ns.inria.fr/covid19/c1e82ab3646382fd09d4e38c7041f89466f5028d,http://www.wikidata.org/entity/Q1148337,hepatocellular carcinoma,http://www.wikidata.org/entity/Q6926073,Mouse hepatitis virus
7,http://ns.inria.fr/covid19/bdcac9db79352158e6d350d9fe430e5be20c22d6,http://www.wikidata.org/entity/Q12078,cancer,http://www.wikidata.org/entity/Q1134583,Coronaviridae
8,http://ns.inria.fr/covid19/bdcac9db79352158e6d350d9fe430e5be20c22d6,http://www.wikidata.org/entity/Q12078,cancer,http://www.wikidata.org/entity/Q290805,Coronavirus
9,http://ns.inria.fr/covid19/6b235789ef1785a5fbe39b0958ef4440eb482f54,http://www.wikidata.org/entity/Q12078,cancer,http://www.wikidata.org/entity/Q290805,Coronavirus


In [444]:
import seaborn as sns
from collections import Counter

figure=plt.figure(figsize=(18, 18))

freq = Counter(df['dis1Label', 'dos2Label'])

cooccurrence_matrix = df.pivot('dis1Label', 'end_label', 'distance' ).fillna(0.0)#.astype('int32')

mask = distance_matrix[(distance_matrix < 100) & (distance_matrix > 0)]
#mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(distance_matrix, cmap='Greens_r',  annot=True, square=True, mask=mask.isna(), vmax=12.0)


ax.xaxis.tick_top() # x axis on top
plt.xticks(rotation=90)
ax.xaxis.set_label_position('top')
plt.show()

In [468]:
pd.DataFrame(df.groupby(['dis1Label', 'dis2Label']).size().reset_index()).columns

Index(['dis1Label', 'dis2Label', 0], dtype='object')

In [472]:
pd.DataFrame(df.groupby(['dis1Label', 'dis2Label'])).pivot('dis1Label', 'dis2Label', 0).fillna(0)

KeyError: 'dis1Label'