To install BlueGraph and the necessary dependecies from the notebook, run the following cell.

In [7]:
# Install bluegraph
! git clone https://github.com/BlueBrain/BlueGraph
! cd BlueGraph
! pip install bluegraph[cord19kg]

# Install graph-tool
!echo "deb http://downloads.skewed.de/apt bionic main" >> /etc/apt/sources.list
!apt-key adv --keyserver keys.openpgp.org --recv-key 612DEFB798507F25
!apt-get update
!apt-get install python3-graph-tool python3-cairo python3-matplotlib

Cloning into 'BlueGraph'...
remote: Enumerating objects: 238, done.[K
remote: Counting objects: 100% (238/238), done.[K
remote: Compressing objects: 100% (95/95), done.[K
remote: Total 2481 (delta 169), reused 159 (delta 140), pack-reused 2243[K
Receiving objects: 100% (2481/2481), 210.12 MiB | 29.32 MiB/s, done.
Resolving deltas: 100% (1474/1474), done.
Checking out files: 100% (214/214), done.
Executing: /tmp/apt-key-gpghome.o3abUwGOqh/gpg.1.sh --keyserver keys.openpgp.org --recv-key 612DEFB798507F25
gpg: key 612DEFB798507F25: "Tiago de Paula Peixoto <tiago@skewed.de>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 

# Topic-centered co-occurrence network analysis of CORD-19

In this notebook we will perform interactive exploration and analysis of a topic-centered subset of the [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset using the `cord19kg` package. The exploration and analysis techniques presented here focus on named entities and their co-occurrence in the scientific articles constituting the dataset.

The input data for this notebook contains the named entities extracted from the 3000 most relevant articles to the query _"Glucose is a risk factor of COVID-19"_ obtained using the article search model [BlueSearch](https://github.com/BlueBrain/Search). The entity extraction was performed using the Named Entity Recognition (NER) techniques also included in [BlueSearch](https://github.com/BlueBrain/Search). The entities represent 10 different types (i.e. proteins, chemicals, drugs, diseases, conditions, organs, organisms, pathways, cell types, cell compartments). 

The interactive literature exploration through the named entity co-occurrence analysis consisting of the following steps:

1. __Data preparation__ step converts raw mentions into aggregated entity occurrence statistics.
2. __Data curation__ step allows the user to manage extracted entities: modify, filter them and link to the ontology.
3. __Network generation__ step allows creating entity co-occurrence networks based on paper-, section- and paragraph-level co-occurrence relations between entities. These entity relations are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).
4. __Network visualization and analysis__ step allows the user to perform interactive network visualization, edit network elements and perform its analysis (spanning tree, mutual-information based shortest paths between entities, etc).

In [14]:
import json
import zipfile

import pandas as pd

import dash_cytoscape as cyto

from cord19kg.utils import (generate_curation_table,
                           link_ontology,
                           generate_cooccurrence_analysis)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

In [15]:
cyto.load_extra_layouts()

## 1. Data preparation

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a NER model.

In [22]:
%%time
print("Decompressing the input data file...")
with zipfile.ZipFile("BlueGraph/cord19kg/examples/data/Glucose_risk_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("BlueGraph/cord19kg/examples/data/")
data = pd.read_csv("BlueGraph/cord19kg/examples/data/Glucose_risk_3000_papers.csv")
print("Done.")

Decompressing the input data file...


FileNotFoundError: ignored

In [20]:
data.sample(5)

Unnamed: 0,entity,entity_type,occurrence
1599944,ventricular,ORGAN,19229:Abstract:486
763765,spinal cord,CELL_TYPE,9907:Stem Cell Research And Injury Repairs Hum...
245072,IRIS,DISEASE,5680:S419:452
1546658,Neisseria gonorrhoeae,ORGANISM,18225:P667 Comparison Of The New Versant® Ct/G...
1731187,SAM,CHEMICAL,21542:P2031 In Vitro Activity Of Tigecycline A...


On the first preparation step, we group and aggregate the input data by unique entities.

In [21]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....
Done.
CPU times: user 2min 42s, sys: 7.68 s, total: 2min 49s
Wall time: 2min 44s


The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of unique paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total entity occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- unique entity types assigned by the NER model (the `entity_type` column, multiple types are possible).
- raw entity types assigned by the NER model with the multiplicity of thier occurrence (the `raw_entity_types` column).


In [23]:
curation_input_table.sample(5)

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_entity_types,raw_frequency
80408,superoxide anion,CHEMICAL,"[2087:Discussion:17, 21665:B2.32:262, 166624:A...","[218972, 8327, 9769, 16201, 212440, 21665, 216...","[8327:Introduction, 228969:Myocardial Damage, ...",36,"[CHEMICAL, CHEMICAL, CHEMICAL, CHEMICAL, CHEMI...",73
32071,furunculosis,DISEASE,"[13187:351:394, 14321:P1586 Chronic Furunculos...","[18225, 13187, 14321, 21542, 8225]",[8225:O152 Integrated Analysis Of Efficacy Of ...,5,"[DISEASE, DISEASE, DISEASE, DISEASE, DISEASE, ...",11
8311,auto-inflammatory diseases,DISEASE,"[21363:Conclusion::1487, 6116:Sp-24:51, 3583:A...","[6116, 21363, 3583]","[3583:A77, 21363:Conclusion, 6116:Sp-24]",3,"[DISEASE, DISEASE, DISEASE, DISEASE]",4
16685,cfbe41o-cell line,CELL_TYPE,[21896:4ଙ Wild-Type Cftr Enhances The Barrier ...,"[21896, 9769, 21948]",[21896:4ଙ Wild-Type Cftr Enhances The Barrier ...,3,"[CELL_TYPE, CELL_TYPE, CELL_TYPE]",3
61884,oxygenated hemoglobin,CHEMICAL,"[21876:Results::239, 7094:J849:269]","[21876, 7094]","[21876:Results, 7094:J849]",2,"[CHEMICAL, CHEMICAL]",2


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [24]:
factor_counts

{'paper': 3000, 'paragraph': 211380, 'section': 53947}

## 2. Data curation

### Loading the NCIT ontology linking data

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To be able to perform such ontology linking, we load some additional (pre-computed using ML-based linking models) data.

In [28]:
%%time
print("Loading the ontology linking data...")
    
print("\tDecompressing the input data file...")
with zipfile.ZipFile("BlueGraph/cord19kg/examples/data/NCIT_ontology_linking_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("BlueGraph/cord19kg/examples/data/")

print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv("BlueGraph/cord19kg/examples/data/NCIT_ontology_linking_3000_papers.csv")

print("\tLoading ontology type mapping...")
with open("BlueGraph/cord19kg/examples/data/NCIT_type_mapping.json", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
	Decompressing the input data file...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Done.
CPU times: user 1.56 s, sys: 186 ms, total: 1.75 s
Wall time: 1.83 s


The ontology linking table contains the following columns:
- `mention` entity mentioned in the text
- `concept` ontology concept linked to the entity mention
- `uid` unique identifier of the ontology concept
- `definition` definition of the concept
- `taxonomy` taxonomy of semantic types associated with the concept

In [29]:
ontology_linking.sample(5)

Unnamed: 0,mention,concept,uid,definition,taxonomy
87885,dialysate leakage,dialysis fluid,http://purl.obolibrary.org/obo/NCIT_C106515,The fluid which runs through a dialysis machin...,[('http://purl.obolibrary.org/obo/NCIT_C70699'...
32247,carbacholine,carbachol,http://purl.obolibrary.org/obo/NCIT_C47430,A synthetic choline ester and a positively cha...,[('http://purl.obolibrary.org/obo/NCIT_C29705'...
51630,human epithelial cancer,ovarian carcinoma,http://purl.obolibrary.org/obo/NCIT_C4908,A malignant neoplasm originating from the surf...,[('http://purl.obolibrary.org/obo/NCIT_C40026'...
57235,galactocerebroside,galactocerebrosidase,http://purl.obolibrary.org/obo/NCIT_C121620,"Galactocerebrosidase (685 aa, ~77 kDa) is enco...",[('http://purl.obolibrary.org/obo/NCIT_C16701'...
115855,bacterial transcription factor,transcription factor,http://purl.obolibrary.org/obo/NCIT_C17207,Transcription factors are a diverse group of p...,[('http://purl.obolibrary.org/obo/NCIT_C26199'...


### Interactive curation of  entity occurrence data

The package provides an interactive entity curation app that allows the user to visualize the entity occurrence data, modify it, perform ontology linking (see `Link to NCIT ontology` button), filter short or unfrequent entities.

The field `Keep` allows specifying a set of entities that must be kept in the dataset at all times (even if they don't satisfy the selected filtering criteria).

Finally the value specified in the `Generate Graphs from top 500 frequent entities` field corresponds to the number of top entities (by the frequency of their occurrence in papers) to be included in the co-occurrence network.

We load the prepared data table into the curation app as follows:

In [30]:
curation_app.set_table(curation_input_table.copy())

We can specify the default entities to keep.

In [31]:
default_entities_to_keep = ["glucose", "covid-19"]
curation_app.set_default_terms_to_include(default_entities_to_keep)

Finally, we set the ontology linking callback to be fired upon a click on the `Link to NCIT ontology` button.

In [32]:
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

### Launch the curation app

The application can be launched either inline (inside the current notebook) as below.

In [33]:
curation_app.run(port=8073, mode="inline")

<IPython.core.display.Javascript object>

Or it can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [34]:
# curation_app.run(port=8070, mode="external")

## 3. Co-occurrence network generation

Current curation table displayed in the curation app can be extracted using the `get_curated_table` method.

In [35]:
curated_occurrence_data = curation_app.get_curated_table()
curated_occurrence_data.sample(3)

Unnamed: 0_level_0,paper,section,paragraph,aggregated_entities,uid,definition,paper_frequency,entity_type
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
systemic hyperinflammation,"{175292, 169902, 160268, 185208, 172250}","{185208:Introduction, 172250:Abstract, 160268:...","{175292:Abstract:1, 160268:Discussion:25, 1852...",[systemic hyperinflammation],,,5,DISEASE
calcium overload,"{9805, 7793, 5949, 13186, 30520, 21768, 56, 64...",{5167:Effects Of Human Growth Hormone Treatmen...,{56:Early Recognition Of An Uncommon Pathogen:...,[calcium overload],,,11,DISEASE
thymopentin,"{17184, 21876, 3872, 7151, 228798}","{17184:Glucagon-Like Peptide-1 , 17184:Oxyntom...","{228798:Dpp4, Appetite, Energy Expenditure, An...",[oxyntomodulin],http://purl.obolibrary.org/obo/NCIT_C1294,A synthetic pentapeptide which is the active s...,5,DRUG


Before we can proceed we need to convert paper/section and paragraph columns into `set`.

In [36]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(set)
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(set)
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(set)

We can also retreive current values of the `Keep` field (these entities will be also included in the resulting co-occurrence network).

In [37]:
curation_app.get_terms_to_include()

['glucose', 'covid-19']

### Generating co-occurrence networks

In the cell below we generate a paper-based entity co-occurrence network. Along with the network generation the `generate_comention_analysis` function:

- computes node centrality metrics (such as degree, RageRank)
- computes co-occurrence statistics (such as frequency, pointwise mutual information and normalized pointwise mutual information) and assignes them as weights to the corresponding edges
- performs entity community detection based on different co-occurrence statistics
- computes mutual-information-based minimum spanning trees.

Here we set the number of entities to include to the 1500 most frequent entities.

Before we run the co-occurrence analysis, we will create a dictionary with backend configurations for the analytics: we set metrics (centalities) computation to use `graph_tool`, community detection to use `networkx` and, finally, path search to use `graph_tool` as well.

In [38]:
import time

In [39]:
backend_configs = {
    "metrics": "graph_tool",
    "communities": "networkx",
    "paths": "graph_tool"
}

In [45]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_cooccurrence_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=200,
    type_data=type_data, 
    factors=["paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=24, 
    backend_configs=backend_configs)  # here set up the number of cores  
print("Done.")

-------------------------------
Factor: paragraph
-------------------------------
Done.
CPU times: user 4.74 s, sys: 1.69 s, total: 6.43 s
Wall time: 34.7 s


## 4. Network visualization and analysis

### Loading the generated graph into the visualization app

First of all, we set a backend for the visualization app (currently two backends are supported: based on `NetworkX` and `graph-tool`, in this example we use the latter).

In [46]:
visualization_app.set_backend("graph_tool")

In [47]:
# #  Run the following use NetworkX as the backend for the visualization app
# visualization_app.set_backend("networkx")

In [75]:
visualization_app.add_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree=trees["paragraph"], default_top_n=100)

visualization_app.set_current_graph("Paragraph-based graph")

KeyError: ignored

### Loading papers' meta-data into the app

We now load an additional dataset containing some meta-data on the papers where the entities analyzed in this notebook occur.

In [50]:
paper_data = pd.read_csv("BlueGraph/cord19kg/examples/data/Glucose_risk_3000_paper_meta_data.csv")
paper_data = paper_data.set_index("id")
paper_data.head(3)

Unnamed: 0_level_0,title,authors,abstract,doi,url,journal,pmc_id,pubmed_id,publish_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,Surfactant protein-D and pulmonary host defense,"Crouch, Erika C",Surfactant protein-D (SP-D) participates in th...,10.1186/rr19,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Respir Res,PMC59549,11667972.0,2000-08-25
56,CLINICAL VIGNETTES,,,10.1046/j.1525-1497.18.s1.20.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1494988,12753119.0,2003-04-01
58,Clinical Vignettes,,,10.1046/j.1525-1497.2001.0160s1023.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1495316,11357836.0,2001-04-01


We pass a callback for the lookup of paper meta-data to the visualization app using the `set_list_papers_callback` method.

In [None]:
def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

The ontology linking process described above is noisy, therefore, we would like to keep a possibility of accessing, the raw entities that were linked to particular ontology concepts. For this we define the function `get_aggregated_entities` that retreives such raw entities and we pass it to the visualization app using the `set_aggregated_entities_callback` method.

In [None]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

Finally, we create a dictionary `definitions` that will serve the visualization app as the lookup table for accessing the definitions of different ontology concepts.

In [None]:
definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

### Launching the visualization app

As before, the interactive graph visualization app can be launched in two modes: inline and external. Here we recommend the external mode for better user experience.

In [None]:
visualization_app.run(port=8081, mode="external")