If you want to run this notebook in Google Colab, run the following cell. Otherwise follow [installation instructions](https://github.com/BlueBrain/BlueGraph/blob/master/README.rst#installation) to install BlueGraph and its dependencies locally.

In [None]:
# Install bluegraph
! git clone https://github.com/BlueBrain/BlueGraph
! cd BlueGraph && pip install .[cord19kg]

# Install graph-tool
!echo "deb http://downloads.skewed.de/apt bionic main" >> /etc/apt/sources.list
!apt-key adv --keyserver keys.openpgp.org --recv-key 612DEFB798507F25
!apt-get update
!apt-get install python3-graph-tool=2.37 python3-cairo python3-matplotlib

DATA_PATH = "BlueGraph/cord19kg/examples/data/"

# Topic-centered co-occurrence network analysis of CORD-19

In this notebook we will perform interactive exploration and analysis of a topic-centered subset of the [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset using the `cord19kg` package. The exploration and analysis techniques presented here focus on named entities and their co-occurrence in the scientific articles constituting the dataset.

The input data for this notebook contains the named entities extracted from the 3000 most relevant articles to the query _"Glucose is a risk factor of COVID-19"_ obtained using the article search model [BlueSearch](https://github.com/BlueBrain/Search). The entity extraction was performed using the Named Entity Recognition (NER) techniques also included in [BlueSearch](https://github.com/BlueBrain/Search). The entities represent 10 different types (i.e. proteins, chemicals, drugs, diseases, conditions, organs, organisms, pathways, cell types, cell compartments). 

The interactive literature exploration through the named entity co-occurrence analysis consisting of the following steps:

1. __Data preparation__ step converts raw mentions into aggregated entity occurrence statistics.
2. __Data curation__ step allows the user to manage extracted entities: modify, filter them and link to the ontology.
3. __Network generation__ step allows creating entity co-occurrence networks based on paper-, section- and paragraph-level co-occurrence relations between entities. These entity relations are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).
4. __Network visualization and analysis__ step allows the user to perform interactive network visualization, edit network elements and perform its analysis (spanning tree, mutual-information based shortest paths between entities, etc).

In [1]:
import json
import os
import zipfile

import pandas as pd

import dash_cytoscape as cyto

from kgforge.core import KnowledgeGraphForge

from cord19kg.utils import (generate_curation_table,
                           link_ontology,
                           generate_cooccurrence_analysis,
                           download_from_nexus)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

In [2]:
try:
    print(f"Data path: '{DATA_PATH}'")
except NameError:
    DATA_PATH = "../data/"
    print(f"Data path: '{DATA_PATH}'")

Data path: '../data/'


In [3]:
cyto.load_extra_layouts()

## 1. Data preparation

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a NER model.
The dataset is stored in Blue Brain Nexus.

In [4]:
# Blue Brain Nexus bucket to download data from

nexus_bucket = "covid19-kg/data"
nexus_endpoint = "https://bbp.epfl.ch/nexus/v1"
nexus_config_file = f"{DATA_PATH}../config/data-download-nexus.yml"

In [5]:
%%time
download_from_nexus(
    uri=f"{nexus_endpoint}/resources/{nexus_bucket}/_/1e01e1a2-133f-4833-9fe0-93230384b95f",
    output_path=DATA_PATH, config_file_path=nexus_config_file,
    nexus_endpoint=nexus_endpoint, nexus_bucket=nexus_bucket, unzip=True)

data = pd.read_csv(f"{DATA_PATH}/Glucose_risk_3000_papers.csv")
print("Done.")

Downloading the file to '../data/Glucose_risk_3000_papers.csv.zip'
Decompressing ...
Done.
CPU times: user 4.27 s, sys: 478 ms, total: 4.75 s
Wall time: 7.78 s


In [6]:
data.sample(5)

Unnamed: 0,entity,entity_type,occurrence
1829936,adenosine,CHEMICAL,21655:Caption:1532
1327062,inflammation,DISEASE,14095:Conclusions::251
901113,hy-povolemic shock,DISEASE,13218:E-19:823
2160250,malaria,DISEASE,22186:1P39:123
185396,anxiety,DISEASE,5150:Results:1854


On the first preparation step, we group and aggregate the input data by unique entities.

In [7]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....
Done.
CPU times: user 30.1 s, sys: 1.11 s, total: 31.2 s
Wall time: 31.5 s


The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of unique paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total entity occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- unique entity types assigned by the NER model (the `entity_type` column, multiple types are possible).
- raw entity types assigned by the NER model with the multiplicity of thier occurrence (the `raw_entity_types` column).


In [8]:
curation_input_table.sample(5)

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_entity_types,raw_frequency
19820,communicable disease,DISEASE,"[22186:Caption:1746, 13091:University Of Wisco...","[222389, 5113, 28602, 210011, 7793, 211843, 21...","[215630:Ethical Considerations, 200371:Establi...",27,"[DISEASE, DISEASE, DISEASE, DISEASE, DISEASE, ...",33
24523,discontinuation,PATHWAY,[21768:Biomarkers Of Small Fiber Neuropathy In...,"[14031, 13659, 14685, 222405, 21768]",[21768:Biomarkers Of Small Fiber Neuropathy In...,5,"[PATHWAY, PATHWAY, PATHWAY, PATHWAY, PATHWAY]",5
56991,neural dysfunction,DISEASE,"[6254:Main:34, 6254:Main:32, 21673:P15.34:407]","[21673, 6254]","[21673:P15.34, 6254:Main]",2,"[DISEASE, DISEASE, DISEASE]",3
77143,sickle cell,CELL_TYPE,[14095:O-148 Preconception Genetic Screening A...,"[23589, 14095]","[23589:Conclusion, 14095:O-148 Preconception G...",2,"[CELL_TYPE, CELL_TYPE]",2
48259,lipid abnormalities,DISEASE,"[208988:Lipids And Covid-19:26, 14091:781:555,...","[226019, 13279, 14091, 208988, 6689, 179121, 1...","[14044:Ag, 13279:Department Of Biochemistry, S...",9,"[DISEASE, DISEASE, DISEASE, DISEASE, DISEASE, ...",9


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [9]:
factor_counts

{'paper': 3000, 'section': 53947, 'paragraph': 211380}

## 2. Data curation

### Loading the NCIT ontology linking data

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To be able to perform such ontology linking, we load some additional (pre-computed using ML-based linking models) data.

In [10]:
%%time
print("Loading the ontology linking data...")

download_from_nexus(
    uri=f"{nexus_endpoint}/resources/{nexus_bucket}/_/4fde1f8f-ee7f-435e-95a8-abb79139db93",
    output_path=DATA_PATH, config_file_path=nexus_config_file,
    nexus_endpoint=nexus_endpoint, nexus_bucket=nexus_bucket, unzip=True)
print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv(f"{DATA_PATH}/NCIT_ontology_linking_3000_papers.csv")


print("\tLoading ontology type mapping...")
ontology_linking_type_mapping_data = download_from_nexus(
    uri=f"{nexus_endpoint}/resources/{nexus_bucket}/_/92bc2a04-6003-4f4d-85e1-dcc5f2352df2", 
    output_path=DATA_PATH, config_file_path=nexus_config_file,
    nexus_endpoint=nexus_endpoint, nexus_bucket=nexus_bucket)
with open(f"{DATA_PATH}/{ontology_linking_type_mapping_data.distribution.name}", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
Downloading the file to '../data/NCIT_ontology_linking_3000_papers.csv.zip'
Decompressing ...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Downloading the file to '../data/NCIT_type_mapping.json'
Done.
CPU times: user 1.81 s, sys: 263 ms, total: 2.08 s
Wall time: 6.28 s


The ontology linking table contains the following columns:
- `mention` entity mentioned in the text
- `concept` ontology concept linked to the entity mention
- `uid` unique identifier of the ontology concept
- `definition` definition of the concept
- `taxonomy` taxonomy of semantic types associated with the concept

In [11]:
ontology_linking.sample(5)

Unnamed: 0,mention,concept,uid,definition,taxonomy
72068,ppid,ppid,,,
16413,braf,braf,,,
75944,tumor-derived exosomes,tumor-derived,http://purl.obolibrary.org/obo/NCIT_C14150,,[('http://purl.obolibrary.org/obo/NCIT_C28101'...
42512,sbis,twice weekly,http://purl.obolibrary.org/obo/NCIT_C64497,Two times per week.,[('http://purl.obolibrary.org/obo/NCIT_C64493'...
147713,microvascular angina64,coronary microvascular disease,http://purl.obolibrary.org/obo/NCIT_C84478,A disorder affecting the smallest coronary art...,[('http://purl.obolibrary.org/obo/NCIT_C35741'...


### Interactive curation of  entity occurrence data

The package provides an interactive entity curation app that allows the user to visualize the entity occurrence data, modify it, perform ontology linking (see `Link to NCIT ontology` button), filter short or unfrequent entities.

The field `Keep` allows specifying a set of entities that must be kept in the dataset at all times (even if they don't satisfy the selected filtering criteria).

Finally the value specified in the `Generate Graphs from top N frequent entities` field corresponds to the number of top entities (by the frequency of their occurrence in papers) to be included in the co-occurrence network.

We load the prepared data table into the curation app as follows:

In [12]:
curation_app.set_table(curation_input_table.copy())

We can specify the default entities to keep.

In [13]:
default_entities_to_keep = ["glucose", "covid-19"]
curation_app.set_default_terms_to_include(default_entities_to_keep)

Finally, we set the ontology linking callback to be fired upon a click on the `Link to NCIT ontology` button.

In [14]:
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

### Launch the curation app

The application can be launched either inline (inside the current notebook) as below. Note that if you run this notebook in Colab, you may want to set a lower number of entities to include, in order to avoid long generation time ('Generate Graphs from top N frequent entities' field). Current default value is 200.

In [15]:
curation_app.run(port=8074, mode="inline")

Merging the occurrence data with the ontology linking...


Or it can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [23]:
# curation_app.run(port=8070, mode="external")

## 3. Co-occurrence network generation

Current curation table displayed in the curation app can be extracted using the `get_curated_table` method.

In [17]:
curated_occurrence_data = curation_app.get_curated_table()
curated_occurrence_data.sample(3)

Unnamed: 0_level_0,paper,section,paragraph,aggregated_entities,uid,definition,paper_frequency,entity_type
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cathepsin-cleavable nir substrate probe 6qc-nir,"{14524, 6400}","{14524:Caption, 6400:Conclusions, 6400:Introdu...","{6400:Introduction:770, 14524:Caption:2054, 64...","[ianire, nire]",http://purl.obolibrary.org/obo/NCIT_C167240,A cysteine cathepsin-cleavable near infrared (...,2,PROTEIN
chronic mental illness,{30878},"{30878:Patients With Mental Health Disorders ,...",{30878:Patients With Mental Health Disorders :...,[chronic mental illness],,,1,DISEASE
pineal gland,"{207712, 21644, 14045, 7067, 14046, 6402, 2165...","{21656:P-04.04.4-005, 14046:P 959 Intracranial...","{14045:0265:418, 21655:Pp7C-63 Melatonin Suppr...","[pineal, pineal gland]",http://purl.obolibrary.org/obo/NCIT_C12398,"A small endocrine gland in the brain, situated...",18,ORGAN


Before we can proceed we need to convert paper/section and paragraph columns into `set`.

In [18]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(set)
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(set)
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(set)

We can also retreive current values of the `Keep` field (these entities will be also included in the resulting co-occurrence network).

In [19]:
curation_app.get_terms_to_include()

['glucose', 'covid-19']

### Generating co-occurrence networks

In the cell below we generate a paper-based entity co-occurrence network. Along with the network generation the `generate_comention_analysis` function:

- computes node centrality metrics (such as degree, RageRank)
- computes co-occurrence statistics (such as frequency, pointwise mutual information and normalized pointwise mutual information) and assignes them as weights to the corresponding edges
- performs entity community detection based on different co-occurrence statistics
- computes mutual-information-based minimum spanning trees.

Before we run the co-occurrence analysis, we will create a dictionary with backend configurations for the analytics: we set metrics (centalities) computation to use `graph_tool`, community detection to use `networkx` and, finally, path search to use `graph_tool` as well.

In [20]:
import time

In [21]:
backend_configs = {
    "metrics": "graph_tool",
    "communities": "networkx",
    "paths": "graph_tool"
}

In [22]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_cooccurrence_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=curation_app.n_most_frequent,
    type_data=type_data, 
    factors=["paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=8,  # here set up the number of cores
    backend_configs=backend_configs)
print("Done.")

-------------------------------
Factor: paragraph
-------------------------------
Done.
CPU times: user 2.67 s, sys: 633 ms, total: 3.3 s
Wall time: 9.63 s


In [23]:
import pickle
with open("dump.pkl", "rb") as f:
    graphs, trees = pickle.load(f)

## 4. Network visualization and analysis

### Loading the generated graph into the visualization app

First of all, we set a backend for the visualization app (currently two backends are supported: based on `NetworkX` and `graph-tool`, in this example we use the latter).

In [24]:
visualization_app.set_backend("graph_tool")

In [25]:
# #  Run the following use NetworkX as the backend for the visualization app
# visualization_app.set_backend("networkx")

In [26]:
visualization_app.add_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree=trees["paragraph"], default_top_n=100)

visualization_app.set_current_graph("Paragraph-based graph")

### Loading papers' meta-data into the app

We now load an additional dataset containing some meta-data on the papers where the entities analyzed in this notebook occur.

In [27]:
paper_medata = download_from_nexus(uri=f"{nexus_endpoint}/resources/{nexus_bucket}/_/8fc1e60c-1ebe-4173-82c0-9775a4917041",
                         output_path = DATA_PATH, config_file_path=nexus_config_file, nexus_endpoint=nexus_endpoint, nexus_bucket=nexus_bucket)
paper_data = pd.read_csv(f"{DATA_PATH}/{paper_medata.distribution.name}")
paper_data = paper_data.set_index("id")
paper_data.head(3)

Downloading the file to '../data/Glucose_risk_3000_paper_meta_data.csv'


Unnamed: 0_level_0,title,authors,abstract,doi,url,journal,pmc_id,pubmed_id,publish_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,Surfactant protein-D and pulmonary host defense,"Crouch, Erika C",Surfactant protein-D (SP-D) participates in th...,10.1186/rr19,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Respir Res,PMC59549,11667972.0,2000-08-25
56,CLINICAL VIGNETTES,,,10.1046/j.1525-1497.18.s1.20.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1494988,12753119.0,2003-04-01
58,Clinical Vignettes,,,10.1046/j.1525-1497.2001.0160s1023.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1495316,11357836.0,2001-04-01


We pass a callback for the lookup of paper meta-data to the visualization app using the `set_list_papers_callback` method.

In [28]:
def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

The ontology linking process described above is noisy, therefore, we would like to keep a possibility of accessing, the raw entities that were linked to particular ontology concepts. For this we define the function `get_aggregated_entities` that retreives such raw entities and we pass it to the visualization app using the `set_aggregated_entities_callback` method.

In [29]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

Finally, we create a dictionary `definitions` that will serve the visualization app as the lookup table for accessing the definitions of different ontology concepts.

In [30]:
definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

### Launching the visualization app

As before, the interactive graph visualization app can be launched in two modes: inline and external. The following cell lauches the app inline.

In [31]:
visualization_app.run(port=8081, mode="inline")

It can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [32]:
visualization_app.run(port=8081, mode="external")

Dash app running on http://127.0.0.1:8081/
