If you want to run this notebook in Google Colab, run the following cell. Otherwise follow [installation instructions](https://github.com/BlueBrain/BlueGraph/blob/master/README.rst#installation) to install BlueGraph and its dependencies locally.

In [None]:
# Install bluegraph
! git clone https://github.com/BlueBrain/BlueGraph
! cd BlueGraph && pip install .[cord19kg]

# Install graph-tool
!echo "deb http://downloads.skewed.de/apt bionic main" >> /etc/apt/sources.list
!apt-key adv --keyserver keys.openpgp.org --recv-key 612DEFB798507F25
!apt-get update
!apt-get install python3-graph-tool python3-cairo python3-matplotlib

DATA_PATH = "BlueGraph/cord19kg/examples/data/"

# Topic-centered co-occurrence network analysis of CORD-19

In this notebook we will perform interactive exploration and analysis of a topic-centered subset of the [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset using the `cord19kg` package. The exploration and analysis techniques presented here focus on named entities and their co-occurrence in the scientific articles constituting the dataset.

The input data for this notebook contains the named entities extracted from the 3000 most relevant articles to the query _"Glucose is a risk factor of COVID-19"_ obtained using the article search model [BlueSearch](https://github.com/BlueBrain/Search). The entity extraction was performed using the Named Entity Recognition (NER) techniques also included in [BlueSearch](https://github.com/BlueBrain/Search). The entities represent 10 different types (i.e. proteins, chemicals, drugs, diseases, conditions, organs, organisms, pathways, cell types, cell compartments). 

The interactive literature exploration through the named entity co-occurrence analysis consisting of the following steps:

1. __Data preparation__ step converts raw mentions into aggregated entity occurrence statistics.
2. __Data curation__ step allows the user to manage extracted entities: modify, filter them and link to the ontology.
3. __Network generation__ step allows creating entity co-occurrence networks based on paper-, section- and paragraph-level co-occurrence relations between entities. These entity relations are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).
4. __Network visualization and analysis__ step allows the user to perform interactive network visualization, edit network elements and perform its analysis (spanning tree, mutual-information based shortest paths between entities, etc).

In [1]:
import json
import os
import zipfile

import pandas as pd

import dash_cytoscape as cyto

from cord19kg.utils import (generate_curation_table,
                           link_ontology,
                           generate_cooccurrence_analysis)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

In [2]:
try:
    print(f"Data path: '{DATA_PATH}'")
except NameError:
    DATA_PATH = "../data/"
    print(f"Data path: '{DATA_PATH}'")

Data path: '../data/'


In [3]:
cyto.load_extra_layouts()

## 1. Data preparation

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a NER model.

In [4]:
%%time
print("Decompressing the input data file...")
with zipfile.ZipFile(os.path.join(DATA_PATH, "Glucose_risk_3000_papers.csv.zip"), 'r') as zip_ref:
    zip_ref.extractall(DATA_PATH)
data = pd.read_csv(os.path.join(DATA_PATH, "Glucose_risk_3000_papers.csv"))
print("Done.")

Decompressing the input data file...
Done.
CPU times: user 2.51 s, sys: 290 ms, total: 2.8 s
Wall time: 2.82 s


In [5]:
data.sample(5)

Unnamed: 0,entity,entity_type,occurrence
525699,pneumonias,DISEASE,7093:Results:217
112688,cancer,DISEASE,2548:V18-4-Jd:90
710214,CFTR,PROTEIN,9769:5W Differences In The Responses Of Human ...
928438,failure,DISEASE,13233:Discussion:270
274119,Staphylococcus capitis,DISEASE,5963:-Poster Presentation:70


On the first preparation step, we group and aggregate the input data by unique entities.

In [6]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....
Done.
CPU times: user 1min 22s, sys: 1.45 s, total: 1min 24s
Wall time: 1min 24s


The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of unique paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total entity occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- unique entity types assigned by the NER model (the `entity_type` column, multiple types are possible).
- raw entity types assigned by the NER model with the multiplicity of thier occurrence (the `raw_entity_types` column).


In [7]:
curation_input_table.sample(5)

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_entity_types,raw_frequency
85039,typhi,"DISEASE, ORGANISM","[22186:Conclusion:603, 214458:Sirs:18, 13279:C...","[22186, 18225, 214458, 13279]",[18225:P1482 Bloodstream Infections Due To Gra...,4,"[ORGANISM, DISEASE, ORGANISM, ORGANISM, ORGANISM]",5
82463,thoracic abnormalities,DISEASE,[1029:C. Common Chest And Abdominal Problems A...,"[7045, 1029]","[7045:A-208 09, 1029:C. Common Chest And Abdom...",2,"[DISEASE, DISEASE, DISEASE]",3
76516,serum igf-system components,PROTEIN,"[5113:Abstract:128, 5113:Abstract:129]",[5113],[5113:Abstract],1,"[PROTEIN, PROTEIN]",2
31300,fluvastatin,DRUG,[22196:In Vitro Effect Of Statins On Intracell...,"[6116, 13187, 21542, 165978, 22196, 13599, 182...","[21656:P-03.03.2-009, 28686:Intrapartum Care G...",11,"[DRUG, DRUG, DRUG, DRUG, DRUG, DRUG, DRUG, DRU...",16
78071,socioeconomic factors,PROTEIN,"[7151:T1:Ps.153:104, 171873:P1220 National Mrs...","[28060, 5856, 204542, 14091, 200371, 215624, 7...","[184544:Discussion, 209939:6. Discussion, 343:...",48,"[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...",69


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [8]:
factor_counts

{'paper': 3000, 'section': 53947, 'paragraph': 211380}

## 2. Data curation

### Loading the NCIT ontology linking data

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To be able to perform such ontology linking, we load some additional (pre-computed using ML-based linking models) data.

In [9]:
%%time
print("Loading the ontology linking data...")
    
print("\tDecompressing the input data file...")
with zipfile.ZipFile(os.path.join(DATA_PATH, "NCIT_ontology_linking_3000_papers.csv.zip"), 'r') as zip_ref:
    zip_ref.extractall(DATA_PATH)

print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv(os.path.join(DATA_PATH, "NCIT_ontology_linking_3000_papers.csv"))

print("\tLoading ontology type mapping...")
with open(os.path.join(DATA_PATH, "NCIT_type_mapping.json"), "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
	Decompressing the input data file...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Done.
CPU times: user 1.23 s, sys: 185 ms, total: 1.41 s
Wall time: 1.43 s


The ontology linking table contains the following columns:
- `mention` entity mentioned in the text
- `concept` ontology concept linked to the entity mention
- `uid` unique identifier of the ontology concept
- `definition` definition of the concept
- `taxonomy` taxonomy of semantic types associated with the concept

In [10]:
ontology_linking.sample(5)

Unnamed: 0,mention,concept,uid,definition,taxonomy
24522,meaib,mea,,,
4594,carbamazepine,carbamazepine,http://purl.obolibrary.org/obo/NCIT_C341,A tricyclic compound chemically related to tri...,"[('http://purl.obolibrary.org/obo/NCIT_C264', ..."
84348,epithelial malignant mesotheliomas,epithelioid mesothelioma,http://purl.obolibrary.org/obo/NCIT_C7985,A malignant neoplasm arising from mesothelial ...,"[('http://purl.obolibrary.org/obo/NCIT_C8420',..."
134393,h. punctata,punctate,http://purl.obolibrary.org/obo/NCIT_C113776,Having tiny spots or depressions.,[('http://purl.obolibrary.org/obo/NCIT_C13442'...
136919,ticam2,ticam2,,,


### Interactive curation of  entity occurrence data

The package provides an interactive entity curation app that allows the user to visualize the entity occurrence data, modify it, perform ontology linking (see `Link to NCIT ontology` button), filter short or unfrequent entities.

The field `Keep` allows specifying a set of entities that must be kept in the dataset at all times (even if they don't satisfy the selected filtering criteria).

Finally the value specified in the `Generate Graphs from top 500 frequent entities` field corresponds to the number of top entities (by the frequency of their occurrence in papers) to be included in the co-occurrence network.

We load the prepared data table into the curation app as follows:

In [11]:
curation_app.set_table(curation_input_table.copy())

We can specify the default entities to keep.

In [12]:
default_entities_to_keep = ["glucose", "covid-19"]
curation_app.set_default_terms_to_include(default_entities_to_keep)

Finally, we set the ontology linking callback to be fired upon a click on the `Link to NCIT ontology` button.

In [13]:
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

### Launch the curation app

The application can be launched either inline (inside the current notebook) as below.

In [14]:
curation_app.run(port=8073, mode="inline")

Opening port number 8073 failed: Address 'http://127.0.0.1:8073' already in use.
    Try passing a different port to run_server.. Trying port number 8074 ...


Or it can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [15]:
# curation_app.run(port=8070, mode="external")

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/Users/oshurko/opt/anaconda3/envs/bg/lib/python3.7/site-packages/retrying.py", line 200, in c

Merging the occurrence data with the ontology linking...


## 3. Co-occurrence network generation

Current curation table displayed in the curation app can be extracted using the `get_curated_table` method.

In [16]:
curated_occurrence_data = curation_app.get_curated_table()
curated_occurrence_data.sample(3)

Unnamed: 0_level_0,paper,section,paragraph,aggregated_entities,uid,definition,paper_frequency,entity_type
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
tra,"{21655, 21806, 209939, 5680, 8327, 18225, 7095}","{209939:2. Theoretical Framework, 18225:P917 M...","{21806:Abstract:664, 18225:P917 Molecular Char...","[tcra, tcrab, tra, traj]",,,7,PROTEIN
orchiectomy,"{5949, 7094, 13250, 14095, 21665}","{14095:F-268, 7094:J335, 13250:E28, 21665:B2.8...","{5949:Objectives.:366, 21665:B2.86:278, 14095:...",[castration],http://purl.obolibrary.org/obo/NCIT_C15288,Surgical removal of one or both testicles.,5,PATHWAY
antibacterial agent,"{210565, 21655, 217999, 227856, 21652, 228773,...","{21655:Pp8-11, 14321:P793 Moxifloxacin Inhibit...","{21673:P19.55:504, 21542:P1340 Comparative Res...","[antibacterial, antibacterial drugs, antibacte...",http://purl.obolibrary.org/obo/NCIT_C52588,A family of substances capable of destroying o...,29,DRUG


Before we can proceed we need to convert paper/section and paragraph columns into `set`.

In [17]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(set)
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(set)
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(set)

We can also retreive current values of the `Keep` field (these entities will be also included in the resulting co-occurrence network).

In [18]:
curation_app.get_terms_to_include()

['glucose', 'covid-19']

### Generating co-occurrence networks

In the cell below we generate a paper-based entity co-occurrence network. Along with the network generation the `generate_comention_analysis` function:

- computes node centrality metrics (such as degree, RageRank)
- computes co-occurrence statistics (such as frequency, pointwise mutual information and normalized pointwise mutual information) and assignes them as weights to the corresponding edges
- performs entity community detection based on different co-occurrence statistics
- computes mutual-information-based minimum spanning trees.

Here we set the number of entities to include to the 1500 most frequent entities.

Before we run the co-occurrence analysis, we will create a dictionary with backend configurations for the analytics: we set metrics (centalities) computation to use `graph_tool`, community detection to use `networkx` and, finally, path search to use `graph_tool` as well.

In [19]:
import time

In [20]:
backend_configs = {
    "metrics": "graph_tool",
    "communities": "networkx",
    "paths": "graph_tool"
}

In [22]:
top_n_entities_to_include = 1500  
# note that if you run this notebook in Colab, you may want to set a lower number
# of entities to include, in order to avoid long generation time

In [None]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_cooccurrence_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=top_n_entities_to_include,
    type_data=type_data, 
    factors=["paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=8, 
    backend_configs=backend_configs)  # here set up the number of cores  
print("Done.")

-------------------------------
Factor: paragraph
-------------------------------


## 4. Network visualization and analysis

### Loading the generated graph into the visualization app

First of all, we set a backend for the visualization app (currently two backends are supported: based on `NetworkX` and `graph-tool`, in this example we use the latter).

In [31]:
visualization_app.set_backend("graph_tool")

In [32]:
# #  Run the following use NetworkX as the backend for the visualization app
# visualization_app.set_backend("networkx")

In [33]:
visualization_app.add_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree=trees["paragraph"], default_top_n=100)

visualization_app.set_current_graph("Paragraph-based graph")

KeyError: '@id'

### Loading papers' meta-data into the app

We now load an additional dataset containing some meta-data on the papers where the entities analyzed in this notebook occur.

In [None]:
paper_data = pd.read_csv(os.path.join(DATA_PATH, "Glucose_risk_3000_paper_meta_data.csv"))
paper_data = paper_data.set_index("id")
paper_data.head(3)

We pass a callback for the lookup of paper meta-data to the visualization app using the `set_list_papers_callback` method.

In [None]:
def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

The ontology linking process described above is noisy, therefore, we would like to keep a possibility of accessing, the raw entities that were linked to particular ontology concepts. For this we define the function `get_aggregated_entities` that retreives such raw entities and we pass it to the visualization app using the `set_aggregated_entities_callback` method.

In [None]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

Finally, we create a dictionary `definitions` that will serve the visualization app as the lookup table for accessing the definitions of different ontology concepts.

In [None]:
definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

### Launching the visualization app

As before, the interactive graph visualization app can be launched in two modes: inline and external. Here we recommend the external mode for better user experience.

In [None]:
visualization_app.run(port=8081, mode="external")