# Topic-centered co-occurrence network analysis of CORD-19

In this notebook we will perform interactive exploration and analysis of a topic-centered subset of the [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset using the `cord19kg` package. The exploration and analysis techniques presented here focus on named entities and their co-occurrence in the scientific articles constituting the dataset.

The input data for this notebook contains the named entities extracted from the 3000 most relevant articles to the query _"Glucose is a risk factor of COVID-19"_ obtained using the article search model [ref to blue brain search](). The entity extraction was performed using the Named Entity Recognition (NER) techniques also included in [ref to blue brain search](). The entities represent 10 different types (i.e. proteins, chemicals, drugs, diseases, condtions, organs, organisms, pathways, cell types, cell compartments). 

The interactive literature exploration through the named entity co-occurrence analysis consisting of the following steps:

1. __Data preparation__ step converts raw mentions into aggregated entity occurrence statistics.
2. __Data curation__ step allows the user to manage extracted entities: modify, filter them and link to the ontology.
3. __Network generation__ step allows creating entity co-occurrence networks based on paper-, section- and paragraph-level co-occurrence relations between entities. These entity relations are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).
4. __Network visualization and analysis__ step allows the user to perform interactive network visualization, edit network elements and perform its analysis (spanning tree, mutual-information based shortest paths between entities, etc).

In [7]:
import json
import zipfile

import pandas as pd

import dash_cytoscape as cyto

from cord19kg.utils import (generate_curation_table,
                           link_ontology,
                           generate_comention_analysis)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

In [3]:
cyto.load_extra_layouts()

## 1. Data preparation

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a NER model.

In [4]:
%%time
print("Decompressing the input data file...")
with zipfile.ZipFile("../data/Glucose_risk_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")
data = pd.read_csv("../data/Glucose_risk_3000_papers.csv")
print("Done.")

Decompressing the input data file...
Done.
CPU times: user 2.2 s, sys: 188 ms, total: 2.39 s
Wall time: 2.4 s


In [5]:
data.sample(5)

Unnamed: 0,entity,entity_type,occurrence
13264,renal,ORGAN,56:Back Pain And Fever In:127
279411,warfarin,CHEMICAL,5999:Methods.:428
864375,survival,PATHWAY,13187:427:503
937741,cats,ORGANISM,13234:Nu05:493
1712315,fever,DISEASE,21542:P1457 Streptobacillus Moniliformis Endoc...


On the first preparation step, we group and aggregate the input data by unique entities.

In [6]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....
Done.
CPU times: user 1min 10s, sys: 1.04 s, total: 1min 11s
Wall time: 1min 11s


The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of unique paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total entity occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- unique entity types assigned by the NER model (the `entity_type` column, multiple types are possible).
- raw entity types assigned by the NER model with the multiplicity of thier occurrence (the `raw_entity_types` column).


In [8]:
curation_input_table.sample(5)

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_entity_types,raw_frequency
34949,gram-negative organisms,ORGANISM,"[13730:0651:270, 8352:P1568:410, 14321:P431 Na...","[182690, 5719, 18225, 16698, 194425, 229372, 7...","[13228:Grant Acknowledgment, 13182:P76, 13106:...",31,"[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...",64
46548,kin,CHEMICAL,"[189045:Exclusion Criteria:57, 13159:Results:3...","[9907, 13159, 189045]","[9907:Caption, 13159:Results, 189045:Exclusion...",3,"[CHEMICAL, CHEMICAL, CHEMICAL]",3
25796,e coli st131,DISEASE,[171873:P1676 Characterisation Of Resistance M...,[171873],[171873:P1676 Characterisation Of Resistance M...,1,"[DISEASE, DISEASE, DISEASE]",3
85175,ubiquitin-specific protease,PROTEIN,"[9769:47:70, 21655:Pp3B-2:190, 79291:Molecular...","[21655, 79291, 9769]","[79291:Molecular Enrichment, 9769:47, 21655:Pp...",3,"[PROTEIN, PROTEIN, PROTEIN]",3
20172,connexin,"PROTEIN, ORGANISM","[14095:F-124:451, 21768:Immune Checkpoint Inhi...","[14095, 21768, 14098]","[14098:Caption, 21768:Immune Checkpoint Inhibi...",3,"[PROTEIN, ORGANISM, PROTEIN]",3


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [9]:
factor_counts

{'paper': 3000, 'section': 53947, 'paragraph': 211380}

## 2. Data curation

### Loading the NCIT ontology linking data

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To be able to perform such ontology linking, we load some additional (pre-computed using ML-based linking models) data.

In [10]:
%%time
print("Loading the ontology linking data...")
    
print("\tDecompressing the input data file...")
with zipfile.ZipFile("../data/NCIT_ontology_linking_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")

print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv("../data/NCIT_ontology_linking_3000_papers.csv")

print("\tLoading ontology type mapping...")
with open("../data/NCIT_type_mapping.json", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
	Decompressing the input data file...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Done.
CPU times: user 1.01 s, sys: 113 ms, total: 1.12 s
Wall time: 1.13 s


The ontology linking table contains the following columns:
- `mention` entity mentioned in the text
- `concept` ontology concept linked to the entity mention
- `uid` unique identifier of the ontology concept
- `definition` definition of the concept
- `taxonomy` taxonomy of semantic types associated with the concept

In [11]:
ontology_linking.sample(5)

Unnamed: 0,mention,concept,uid,definition,taxonomy
150330,a696glucocorticoid receptor,glucocorticoid receptor,http://purl.obolibrary.org/obo/NCIT_C17071,"Glucocorticoid receptor (777 aa, ~86 kDa) is e...",[('http://purl.obolibrary.org/obo/NCIT_C18108'...
102687,mitochondrial adenine nucleotide translocator,adp/atp translocase 1,http://purl.obolibrary.org/obo/NCIT_C37299,"ADP/ATP translocase 1 (298 aa, ~33 kDa) is enc...",[('http://purl.obolibrary.org/obo/NCIT_C37297'...
14680,strictured,stenosis,http://purl.obolibrary.org/obo/NCIT_C50754,"Narrowing or stricture of a vessel, duct or ca...",[('http://purl.obolibrary.org/obo/NCIT_C36295'...
156151,doif,dogri language,http://purl.obolibrary.org/obo/NCIT_C153902,An Indo-Aryan language spoken in India and Pak...,[('http://purl.obolibrary.org/obo/NCIT_C161844...
99683,cultured bronchial epithelial cells,ciliated bronchial epithelial cell,http://purl.obolibrary.org/obo/NCIT_C32317,A columnar-shaped cell found in the epithelium...,[('http://purl.obolibrary.org/obo/NCIT_C54242'...


### Interactive curation of  entity occurrence data

The package provides an interactive entity curation app that allows the user to visualize the entity occurrence data, modify it, perform ontology linking (see `Link to NCIT ontology` button), filter short or unfrequent entities.

The field `Keep` allows specifying a set of entities that must be kept in the dataset at all times (even if they don't satisfy the selected filtering criteria).

Finally the value specified in the `Generate Graphs from top 500 frequent entities` field corresponds to the number of top entities (by the frequency of their occurrence in papers) to be included in the co-occurrence network.

We load the prepared data table into the curation app as follows:

In [12]:
curation_app.set_table(curation_input_table.copy())

We can specify the default entities to keep.

In [13]:
default_entities_to_keep = ["glucose", "covid-19"]
curation_app.set_default_terms_to_include(default_entities_to_keep)

Finally, we set the ontology linking callback to be fired upon a click on the `Link to NCIT ontology` button.

In [14]:
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

### Launch the curation app

The application can be launched either inline (inside the current notebook) as below.

In [16]:
curation_app.run(port=8072, mode="inline")

Merging the occurrence data with the ontology linking...


Or it can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [None]:
# curation_app.run(port=8070, mode="external")

## 3. Co-occurrence network generation

Current curation table displayed in the curation app can be extracted using the `get_curated_table` method.

In [19]:
curated_occurrence_data = curation_app.get_curated_table()
curated_occurrence_data.sample(5)

Unnamed: 0_level_0,aggregated_entities,paragraph,paper,section,raw_entity_types,uid,definition,taxonomy,paper_frequency,entity_type,entity_type_label
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
inguinal region,"[groin, groin pain, groins, inguinal, inguinal...",[18225:P845 Candida Colonization Among Paediat...,"[6402, 1407, 18225, 6457, 6689, 7105, 2548, 97...",[6719:An Unusual Presentation Of Recurrence Of...,"[ORGAN, ORGAN, ORGAN, ORGAN, ORGAN, ORGAN, ORG...",http://purl.obolibrary.org/obo/NCIT_C12726,The lower region of the anterior abdominal wal...,[('http://purl.obolibrary.org/obo/NCIT_C12680'...,31,ORGAN,ORGAN
pectoralis muscle,"[pectoral muscle, pectoral muscles, pectoralis...","[21942:11.:170, 7076:E-Ps:862, 13921:Figure 1....","[7076, 7075, 179328, 21942, 7045, 6209, 13921]","[7076:E-Ps, 21942:9., 21942:11., 6209:Caption,...","[ORGAN, ORGAN, ORGAN, ORGAN, ORGAN, ORGAN, ORG...",http://purl.obolibrary.org/obo/NCIT_C33286,Muscles of the upper chest. The term may refer...,[('http://purl.obolibrary.org/obo/NCIT_C13056'...,7,ORGAN,ORGAN
streptococcus canis,"[streptococcus canis, streptococcus dysgalactiae]","[13320:Caption:978, 171873:P772 Haemolytic Ura...","[13320, 8225, 6122, 18225, 171873, 21542, 22081]","[13320:Caption, 21542:P998 Antimicrobial Resis...","[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...",http://purl.obolibrary.org/obo/NCIT_C86786,"A species of facultatively anaerobic, Gram pos...",[('http://purl.obolibrary.org/obo/NCIT_C76383'...,7,ORGANISM,ORGANISM
eq-5d-3l - anxiety or depression,"[anxiety/ depression, anxiety/depression]","[13232:Competing Interests:20, 13228:Nonventil...","[222405, 58, 343, 21947, 13187, 13232, 13228, ...",[21947:The Effect Of Depression And Anxiety Up...,"[DISEASE, DISEASE, DISEASE, DISEASE, DISEASE, ...",http://purl.obolibrary.org/obo/NCIT_C100396,The EuroQol (European Quality of Life) Five Di...,[('http://purl.obolibrary.org/obo/NCIT_C100114...,12,DISEASE,DISEASE
aortic valve insufficiency,"[aortic insufficiency, aortic valve insufficie...","[58:Caption:702, 58:Is Patient Trust Of Physic...","[58, 7093, 1407, 7075, 493, 14065, 21846, 1922...","[493:P-308, 7093:Results, 1407:A-070 08, 7075:...","[DISEASE, DISEASE, DISEASE, DISEASE, DISEASE, ...",http://purl.obolibrary.org/obo/NCIT_C51223,Dysfunction of the aortic valve characterized ...,[('http://purl.obolibrary.org/obo/NCIT_C78650'...,10,DISEASE,DISEASE


Before we can proceed we need to convert paper/section and paragraph columns into `set`.

In [20]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(set)
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(set)
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(set)

We can also retreive current values of the `Keep` field (these entities will be also included in the resulting co-occurrence network).

In [21]:
curation_app.get_terms_to_include()

['glucose', 'covid-19']

### Generating co-occurrence networks

In the cell below we generate a paper-based entity co-occurrence network. Along with the network generation the `generate_comention_analysis` function:

- computes node centrality metrics (such as degree, RageRank)
- computes co-occurrence statistics (such as frequency, pointwise mutual information and normalized pointwise mutual information) and assignes them as weights to the corresponding edges
- performs entity community detection based on different co-occurrence statistics
- computes mutual-information-based minimum spanning trees.

Here we set the number of entities to include to the 1500 most frequent entities.

In [37]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_comention_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=1500,
    type_data=type_data, 
    factors=["paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=8)  # here set up the number of cores  
print("Done.")

-------------------------------
Factor: paragraph
-------------------------------
Fitering data.....
Selected 1500 most frequent terms
Examining 1124250 pairs of terms for co-occurrence...
Generated 693669 edges                    
Created a co-occurrence graph:
	number of nodes:  1500
	number of edges:  693669
Saving the edges...
Creating a graph object...

Computing degree centrality statistics....
Top n nodes by frequency:
	covid-19 (174251)
	blood (166103)
	human (157878)
	lung (138504)
	infectious disorder (138176)
	heart (127273)
	diabetes mellitus (114019)
	mouse (101697)
	inflammation (95174)
	liver (93130)

Computing PageRank centrality statistics....
Top n nodes by frequency:
	blood (0.01)
	covid-19 (0.01)
	human (0.01)
	infectious disorder (0.01)
	lung (0.01)
	heart (0.01)
	diabetes mellitus (0.01)
	mouse (0.01)
	liver (0.01)
	inflammation (0.01)

Using the 'frequency' weight...
Detecting communities...
Best network partition:
	 Number of communities: 6
	 Modularity: 0.17856

## 4. Network visualization and analysis

### Loading the generated graph into the visualization app

In [39]:
visualization_app.set_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree_object=trees["paragraph"], default_top_n=100)

visualization_app.set_current_graph("Paragraph-based graph")

### Loading papers' meta-data into the app

We now load an additional dataset containing some meta-data on the papers where the entities analyzed in this notebook occur.

In [40]:
paper_data = pd.read_csv("../data/Glucose_risk_3000_paper_meta_data.csv")
paper_data = paper_data.set_index("id")
paper_data.head(3)

Unnamed: 0_level_0,title,authors,abstract,doi,url,journal,pmc_id,pubmed_id,publish_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,Surfactant protein-D and pulmonary host defense,"Crouch, Erika C",Surfactant protein-D (SP-D) participates in th...,10.1186/rr19,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Respir Res,PMC59549,11667972.0,2000-08-25
56,CLINICAL VIGNETTES,,,10.1046/j.1525-1497.18.s1.20.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1494988,12753119.0,2003-04-01
58,Clinical Vignettes,,,10.1046/j.1525-1497.2001.0160s1023.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1495316,11357836.0,2001-04-01


We pass a callback for the lookup of paper meta-data to the visualization app using the `set_list_papers_callback` method.

In [41]:
def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

The ontology linking process described above is noisy, therefore, we would like to keep a possibility of accessing, the raw entities that were linked to particular ontology concepts. For this we define the function `get_aggregated_entities` that retreives such raw entities and we pass it to the visualization app using the `set_aggregated_entities_callback` method.

In [42]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

Finally, we create a dictionary `definitions` that will serve the visualization app as the lookup table for accessing the definitions of different ontology concepts.

In [43]:
definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

### Launching the visualization app

As before, the interactive graph visualization app can be launched in two modes: inline and external. Here we recommend the external mode for better user experience.

In [44]:
visualization_app.run(port=8081, mode="external")

Dash app running on http://127.0.0.1:8081/
