# Analysis of Harvested Bioschemas Data

__Authors:__  
Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Imran Asif ([ORCID:0000-0002-1144-6265](https://orcid.org/0000-0002-1144-6265)), _Heriot-Watt University, Edinburgh, UK_

Alban Gaignard


__License:__ Apache 2.0

__Acknowledgements:__ This work was funded as part of the [ELIXIR Interoperabiltiy Platform](https://elixir-europe.org/platforms/interoperability) Strategic Implementation Study [Exploiting Bioschemas Markup to Support ELIXIR Communities](https://elixir-europe.org/about-us/commissioned-services/exploiting-bioschemas-markup-support-elixir-communities). This notebook has been adapted from the [notebook](https://github.com/AlasdairGray/IDP-KG/commit/2290f61b38cac40a2397ac0766a536f5ca27223e) developed for the Intrinsically Disordered Protein Knowledge Graph ([IDP-KG](https://alasdairgray.github.io/IDP-KG/)).

## Introduction

In this notebook we supply a user interface to query the harvested Bioschemas markup stored in an [online triplestore](https://swel.macs.hw.ac.uk/data/) in the `bioschemas` repository. Details of the data harvested and loaded into the triplestore can be found [here](https://github.com/BioSchemas/bioschemas-data-harvesting/).

The available queries can be grouped as discussed in the following subsections.

### HCLS Dataset Statistics

As part of the W3C Health Care and Life Sciences Interest Group [guidelines for dataset descriptions](https://www.w3.org/TR/hcls-dataset/), they provided a set of queries that provide useful statistics to understand a dataset. The queries range from providing basic statistics such as number of triples, classes, and objects to more complex statistics such as the number of typed objects linked to a property. 

The queries are hosted in the [HCLS-Stats-Queries repository](https://github.com/AlasdairGray/HCLS-Stats-Queries). The queries are available through the interface below:
1. Number of triples
1. Number of unique typed entities
1. Number of unique subjects
1. Number of unique properties
1. Number of unique objects
1. Number of unique classes
1. Number of unique literals
1. Number of graphs
1. Number of instances per class
1. Number of occurrences of each property
1. Number of unique typed subjects and triples linked with each property
1. Number of unique typed objects linked with each property
1. Triples and number of unique literals related with each property
1. Number of unique subject types that are linked to unique object types

### Bioschemas Analysis Queries
The HCLS statistics provide generic dataset statistics. We will now focus on information about the data content.

15. Number of instances per Bioschemas type: Restricts the more generic HCLS query to the types of interest to the Bioschemas community
1. Number of domains: Return the number of web domains that have been harvested.
1. Number of pages per domain: Return the count of the number of harvested pages per web domain.
1. Number of DataFeeds: Return the number of domains that were loaded through a DataFeed.
1. Number of instances per class per domain: Return the count of the number of instances per web domain.
1. Count of nodes with only incoming edges: Count the number of nodes that only have incoming edges.
1. Count of nodes with a single incoming edge AND no outgoing edge: this query has a large number of results which cannot be streamed back from the server to the notebook. The query has been limited to 1,000 results.
1. Count outgoing links (return top 1,000): Return the 1,000 classes with the most outgoing edges.

## Query Execution

The cells in this section set up the querying infrastructure for querying the external SPARQL endpoint. Queries are read in from the [HCLS-Stats-Queries repository](https://github.com/AlasdairGray/HCLS-Stats-Queries) and [query directory](https://github.com/BioSchemas/bioschemas-data-harvesting/tree/main/queries).

It then presents a dropdown menu to allow you to select which query to execute. The option 'All queries' executes each query in turn, displaying the answers in separate blocks below.

### Library Imports

The notebook makes use of a number of libraries to execute the queries and display the results back.

In [None]:
# Import and configure logging library
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpQuery.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

In [None]:
# Imports for the UI
import ipywidgets as widgets
from ipywidgets import Layout
from IPython.core.display import display, HTML, Javascript
from IPython.display import clear_output
import clipboard
from SPARQLWrapper import SPARQLWrapper, JSON, XML
import json
import glob
import html

The following cell imports the SSL library so that we can disable the verification of the certificate on the SPARQL server. This is to get around a bug that stopped the notebook from working in late January 2022. Ideally we would not need to do this.

In [None]:
# Import SSL and disable checking of certificate
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
# Imports from RDFlib
import rdflib
from rdflib import ConjunctiveGraph, plugin
from rdflib.serializer import Serializer

In [None]:
%%html
<!-- Reading in external style sheet -->
<link rel="stylesheet" href="css/custom.css">

### Result Display Function

The following function takes the results of a `SPARQL SELECT` query and displays them using a HTML table for human viewing.

In [None]:
md_format = ''
def displayResults(queryResult):
    HTMLResult = '<div class="full-width"><p>Number of results: ' + str(len(queryResult['results']['bindings'])) + '</p>'
    HTMLResult = HTMLResult + '<table class="table"><tr>'
    
    global md_format
    md_format = ''
    # print variable names and build header:
    for varName in queryResult['head']['vars']:
        HTMLResult = HTMLResult + '<th>' + varName + '</th>'
        md_format = md_format + ' | ' + varName
    HTMLResult = HTMLResult + '</tr>'
    
    md_format = md_format + ' | \n'
    #define Align of columns
    for row in queryResult['results']['bindings']:
        for column in queryResult['head']['vars']:
            if column != "":
                strVal = str(row[column]['value'])
                if not strVal.isnumeric():
                    md_format = md_format + ' | :---'
                else:
                    md_format = md_format + ' | ---:'
        break

    md_format = md_format + ' | \n'
    # print values from each row and build table of results
    for row in queryResult['results']['bindings']:
        HTMLResult = HTMLResult + '<tr>' 
        for column in queryResult['head']['vars']:
            #print("COLUMN:", column)
            if column != "":
                strVal = str(row[column]['value'])
                align = "" if not strVal.isnumeric() else "text-align: right;padding-right:1%;"
                value = strVal if not strVal.isnumeric() else "{:,}".format(int(strVal)) 
                HTMLResult = HTMLResult + '<td style="'+align+'">' +  value + '</td>'
                md_format = md_format + ' | ' + value
            else:
                HTMLResult = HTMLResult + '<td>' + "N/A"+ '</td>'
                md_format = md_format + ' | N/A'
        HTMLResult = HTMLResult + '</tr>'
        md_format = md_format + ' | \n'
        
    HTMLResult = HTMLResult + '</table></div>'
    return HTMLResult
    #display(HTML(HTMLResult))

### Loading and Querying IDP-KG

The following code configures the SPARQL endpoint to use. It also orders the queries that can be chosen as those that correspond the HCLS statistics queries or Bioschemas analysis queries.

In [None]:
idpKG = None
opt = ''  #selection option
query_options = [] # use in dropdown list

queryOrder = { #All queries must be entered here.
			  1: 'HCLS-Stats-Queries/number-triples.rq', 
     		  2: 'HCLS-Stats-Queries/typed-entities.rq', 
			  3: 'HCLS-Stats-Queries/number-subjects.rq', 
			  4: 'HCLS-Stats-Queries/number-properties.rq', 
			  5: 'HCLS-Stats-Queries/number-objects.rq',
			  6: 'HCLS-Stats-Queries/number-classes.rq',
			  7: 'HCLS-Stats-Queries/number-literals.rq',
			  8: 'HCLS-Stats-Queries/number-graphs.rq',
			  9: 'HCLS-Stats-Queries/class-count.rq',
			  10: 'HCLS-Stats-Queries/properties-ccurence.rq',
			  11: 'HCLS-Stats-Queries/property-subjects-triples.rq',
			  12: 'HCLS-Stats-Queries/number-typed-objects-linked-property.rq',
			  13: 'HCLS-Stats-Queries/triples-literals-related-property.rq',
			  14: 'HCLS-Stats-Queries/number-subject-types-object-types.rq',
              15: 'queries/instancesPerBioschemasType.rq',
              16: 'queries/numberOfDomains.rq',
              17: 'queries/pagesPerDomain.rq',
              18: 'queries/numberDomainFeeds.rq',
              19: 'queries/countTypeInstancePerDomain.rq',
              20: 'queries/countNodesOnlyIncomingEdges.rq',
              21: 'queries/countSingleIncomingEdge.rq',
              22: 'queries/countOutgoingLinks.rq'
			 }

# Change the order in dict key, the following line sort it and execute query
queryOrder = {k: queryOrder[k] for k in sorted(queryOrder)}

query_options.append(('-- Select Query --', 'select'))
query_options.append(('All Queries', 'all'))

for key in queryOrder:
    text = '' #queryOrder[key].split("/")[1]
    with open(queryOrder[key]) as f:
        query = f.read()
        if '#' in query.partition('\n')[0]:
            text = query.partition('\n')[0].replace('#','').strip()
            
    query_options.append((text, queryOrder[key]))
    
# Following method set the graph variable to sparql endpoint or local in-memory    
def set_variable(loadingOpt, endpoint):
    global idpKG
    if loadingOpt == 'sparql':
        idpKG = SPARQLWrapper(endpoint)
        idpKG.method = 'POST'
        idpKG.setReturnFormat(JSON)
        logging.info("SPARQL Endpoint: %s" % endpoint)
    else:
        idpKG = ConjunctiveGraph()
        idpKG.parse(endpoint, format="nquads")
        #idpKG.serialize(format="json-ld") 
        logging.info("\tIDP-KG has %s statements." % len(idpKG))

def query_idpkg(query, loadingOpt):
    if loadingOpt == 'sparql':
        idpKG.setQuery(query)
        results = idpKG.queryAndConvert()
        #ToDo: add log message here giving number of results
        logging.info("Number of Results: %s" % str(len(results['results']['bindings'])))
        return results
    else:
        results = idpKG.query(query)
        results = json.loads(results.serialize(format="json"))
        logging.info("Number of Results: %s" % str(len(results['results']['bindings'])))
        return results

def copied_md_format(e):
    clipboard.copy(md_format)
    print('copied')

def display_query_results(query, results, query_title, queryFile):
    btnCopy = widgets.Button(
        description='Copy .md',
        disabled=False,
        button_style='', # 'success', 'info', 'warning', 'danger' or ''
        icon='download',
        layout=Layout(left='79%')
    )
    btnCopy.on_click(copied_md_format)
    accordion = widgets.Accordion(children=[widgets.HTML(value=query), widgets.VBox([btnCopy, widgets.HTML(value=results)])], 
                              selected_index=1,
                              layout=Layout(width='97%'))
    accordion.set_title(0, 'Query: ' + query_title + ' ('+'File: ' + queryFile+')')
    accordion.set_title(1, 'Results')
    display(accordion)

query_editor = widgets.Textarea(
    value = '',
    disabled = False,
    rows = 10,
    layout = Layout(width = '90%')
)

def getQueryText(queryFile):
    with open(queryFile) as f:
        query = f.read()
        query_editor.value = query
    
def runQuery(query, queryFile):
    first_line = ''
    
    if query == 'all':
        with open(queryFile) as f:
            query = f.read()
            first_line = query.partition('\n')[0]
        
            if '#' in first_line:
                query = query.split("\n",1)[1]
            else:
                first_line = ''
            
    display(HTML('<hr />'))
    height = '40vh'
    if len(html.escape(query).splitlines()) < 40:
        height = str(len(html.escape(query).splitlines()))
    display_query = '<div style="height:'+height+';overflow-x:auto;overflow-y:auto;"><pre>'+html.escape(query)+'</pre></div>'
    logging.info('Executing query in file: ' + queryFile)
    logging.debug('Query:\n' + query)
    try:
        HTMLResult = displayResults(query_idpkg(query, opt))
        display_query_results(display_query, HTMLResult, first_line.replace('#',''), queryFile)
    except Exception as e:
        logging.error('Exception running query: ' + str(e) + '\nQuery:\n ' + query + '\noptions: ' + opt)
        display_query_results(display_query, 'Exception: ' + str(e), first_line.replace('#',''), queryFile)
            
##########################################################################################################
#Create Selection GUI

dropdown = widgets.Dropdown(
            options=query_options,
            value='select'
        )

output = widgets.Output()

def on_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        hideMessage()
        output.clear_output()
        if 'select' != change['new']:
            if 'all' != change['new']:
                query_editor.layout.display = ''
                getQueryText(change['new'])
            else:
                query_editor.value = ''
                query_editor.layout.display = 'none'
        else:
                query_editor.value = ''
            
dropdown.observe(on_change)

msgHTML = widgets.HTML(value = '')
    
btn = widgets.Button(
    description='Execute',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Execute',
    icon='check'
)

def showMessage(msg):
    msgHTML.value = f'''<div class="alert">
                          <strong>Error:</strong> {msg}
                     </div>'''
def hideMessage():
    msgHTML.value = ''

def loading():
    btn.icon = 'hourglass-1'
    btn.description = 'Busy'
    btn.disabled = True
    
def endLoading():
    btn.icon = 'check'
    btn.description = 'Execute'
    btn.disabled = False

def createSelectionGUI():
    btn.on_click(on_button_clicked)
    display(widgets.VBox([dropdown, query_editor]), widgets.VBox([btn, msgHTML]), output)

def on_button_clicked(e):
    loading()
    hideMessage()
    with output:
        global opt
        set_variable('sparql', 'https://swel.macs.hw.ac.uk/data/repositories/bioschemas')
        opt = 'sparql'
            
        clear_output(True)
        
        #Execute query
        
        if opt != '':
            logging.debug('Dropdown selection: %s' % dropdown.value)
            logging.debug('Query: %s' % query_editor.value)
            if dropdown.value == 'all':
                for key in queryOrder:
                    logging.info('Run query %d' % key)
                    runQuery('all', queryOrder[key])
            else:
                if query_editor.value != '':
                    logging.info('Run query %s' % dropdown.value)
                    sparql_query = query_editor.value
                    runQuery(sparql_query, dropdown.value)
                else:
                    showMessage('There is no query to execute. Please select a query from the dropdown menu to execute.')
        
        endLoading()

## Query Interface

The following cell geneerates the UI and allows the selection of query to run. 

The results of the query execution are displayed below. The query box can be expanded to show the text of the query. The results box is scrollable.

Descriptions of the first 14 quereies can be found in the [HCLS Dataset Description Community Profile](https://www.w3.org/TR/hcls-dataset/#s6_6). The other queries are explained [above](#Bioschemas-Analysis-Queries).

In [None]:
createSelectionGUI()