# Topic-centered co-occurrence network analysis of CORD-19

In this notebook we will perform interactive exploration and analysis of a topic-centered subset of the [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset using the `cord19kg` package. The exploration and analysis techniques presented here focus on named entities and their co-occurrence in the scientific articles constituting the dataset.

The input data for this notebook contains the named entities extracted from the 3000 most relevant articles to the query _"Glucose is a risk factor of COVID-19"_ obtained using the article search model [ref to blue brain search](). The entity extraction was performed using the Named Entity Recognition (NER) techniques also included in [ref to blue brain search](). The entities represent 10 different types (i.e. proteins, chemicals, drugs, diseases, conditions, organs, organisms, pathways, cell types, cell compartments). 

The interactive literature exploration through the named entity co-occurrence analysis consisting of the following steps:

1. __Data preparation__ step converts raw mentions into aggregated entity occurrence statistics.
2. __Data curation__ step allows the user to manage extracted entities: modify, filter them and link to the ontology.
3. __Network generation__ step allows creating entity co-occurrence networks based on paper-, section- and paragraph-level co-occurrence relations between entities. These entity relations are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).
4. __Network visualization and analysis__ step allows the user to perform interactive network visualization, edit network elements and perform its analysis (spanning tree, mutual-information based shortest paths between entities, etc).

In [1]:
import json
import zipfile

import pandas as pd

import dash_cytoscape as cyto

from cord19kg.utils import (generate_curation_table,
                           link_ontology,
                           generate_cooccurrence_analysis)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

In [2]:
cyto.load_extra_layouts()

## 1. Data preparation

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a NER model.

In [3]:
%%time
print("Decompressing the input data file...")
with zipfile.ZipFile("../data/Glucose_risk_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")
data = pd.read_csv("../data/Glucose_risk_3000_papers.csv")
print("Done.")

Decompressing the input data file...
Done.
CPU times: user 2.3 s, sys: 278 ms, total: 2.58 s
Wall time: 2.61 s


In [4]:
data.sample(5)

Unnamed: 0,entity,entity_type,occurrence
1200615,cancers,DISEASE,14046:P 588 Absence Of Braf Mutations In Hyali...
1972085,chronic feline idiopathic cystitis,DISEASE,21810:Nu:922
2391633,hypoxemia,DISEASE,171819:Rationale ::: O. Mechanical Ventilation...
2590446,genus coronavirus,ORGANISM,186907:1. Introduction:2
780955,Asthma,DISEASE,10338:Conclusions::161


On the first preparation step, we group and aggregate the input data by unique entities.

In [5]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....
Done.
CPU times: user 1min 13s, sys: 1.73 s, total: 1min 15s
Wall time: 1min 16s


The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of unique paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total entity occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- unique entity types assigned by the NER model (the `entity_type` column, multiple types are possible).
- raw entity types assigned by the NER model with the multiplicity of thier occurrence (the `raw_entity_types` column).


In [6]:
curation_input_table.sample(5)

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_entity_types,raw_frequency
63290,paraquat,"DRUG, CHEMICAL","[6453:037:25, 13279:Introduction::714, 14524:5...","[7094, 18225, 777, 6453, 21652, 13279, 5167, 1...",[18225:P747 Nacl And Clonazepam Induce Low-Lev...,9,"[CHEMICAL, DRUG, CHEMICAL, DRUG, CHEMICAL, DRU...",47
9401,baff-r,PROTEIN,"[14524:University Of Glasgow, Glasgow, United ...","[160632, 22196, 8327, 14042, 14524]","[14524:University Of Glasgow, Glasgow, United ...",5,"[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...",27
53818,monoterpenoids,"DRUG, CHEMICAL",[16201:Fruit Nutrients ::: Nutritive/Medicinal...,"[178755, 7076, 16201, 205963]","[7076:Ps-01-059, 178755:1. Introduction, 16201...",4,"[CHEMICAL, DRUG, CHEMICAL, DRUG, CHEMICAL, DRU...",10
67936,pr 1,PROTEIN,"[21846:Caption:651, 21846:Caption:776]",[21846],[21846:Caption],1,"[PROTEIN, PROTEIN]",2
48664,location,PATHWAY,"[7076:Institute Of Pathology Skopje, Republic ...","[6719, 14045, 5719, 21806, 21625, 21948, 7076,...","[7076:Institute Of Pathology Skopje, Republic ...",9,"[PATHWAY, PATHWAY, PATHWAY, PATHWAY, PATHWAY, ...",11


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [7]:
factor_counts

{'paper': 3000, 'section': 53947, 'paragraph': 211380}

## 2. Data curation

### Loading the NCIT ontology linking data

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To be able to perform such ontology linking, we load some additional (pre-computed using ML-based linking models) data.

In [13]:
%%time
print("Loading the ontology linking data...")
    
print("\tDecompressing the input data file...")
with zipfile.ZipFile("../data/NCIT_ontology_linking_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")

print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv("../data/NCIT_ontology_linking_3000_papers.csv")

print("\tLoading ontology type mapping...")
with open("../data/NCIT_type_mapping.json", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
	Decompressing the input data file...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Done.
CPU times: user 1 s, sys: 159 ms, total: 1.16 s
Wall time: 1.2 s


The ontology linking table contains the following columns:
- `mention` entity mentioned in the text
- `concept` ontology concept linked to the entity mention
- `uid` unique identifier of the ontology concept
- `definition` definition of the concept
- `taxonomy` taxonomy of semantic types associated with the concept

In [14]:
ontology_linking.sample(5)

Unnamed: 0,mention,concept,uid,definition,taxonomy
77116,hepatic cyst,hepatic cyst,http://purl.obolibrary.org/obo/NCIT_C3960,A cystic lesion located in the liver.,[('http://purl.obolibrary.org/obo/NCIT_C36279'...
91931,nuclear accumulation,accumulation,http://purl.obolibrary.org/obo/NCIT_C120860,A state characterized by the gradual increase ...,[('http://purl.obolibrary.org/obo/NCIT_C36295'...
59229,phocine herpes virus-1,herpesvirus,http://purl.obolibrary.org/obo/NCIT_C14217,A heterogeneous family of morphologically simi...,[('http://purl.obolibrary.org/obo/NCIT_C14348'...
77432,aggressive t-cell lymphomas,aggressive non-hodgkin lymphoma,http://purl.obolibrary.org/obo/NCIT_C9244,A non-Hodgkin lymphoma with an aggressive clin...,"[('http://purl.obolibrary.org/obo/NCIT_C7215',..."
42915,comel-netherton syndrome,netherton syndrome,http://purl.obolibrary.org/obo/NCIT_C84922,A rare autosomal recessive form of ichthyosis ...,[('http://purl.obolibrary.org/obo/NCIT_C28193'...


### Interactive curation of  entity occurrence data

The package provides an interactive entity curation app that allows the user to visualize the entity occurrence data, modify it, perform ontology linking (see `Link to NCIT ontology` button), filter short or unfrequent entities.

The field `Keep` allows specifying a set of entities that must be kept in the dataset at all times (even if they don't satisfy the selected filtering criteria).

Finally the value specified in the `Generate Graphs from top 500 frequent entities` field corresponds to the number of top entities (by the frequency of their occurrence in papers) to be included in the co-occurrence network.

We load the prepared data table into the curation app as follows:

In [15]:
curation_app.set_table(curation_input_table.copy())

We can specify the default entities to keep.

In [16]:
default_entities_to_keep = ["glucose", "covid-19"]
curation_app.set_default_terms_to_include(default_entities_to_keep)

Finally, we set the ontology linking callback to be fired upon a click on the `Link to NCIT ontology` button.

In [17]:
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

### Launch the curation app

The application can be launched either inline (inside the current notebook) as below.

In [18]:
curation_app.run(port=8073, mode="inline")

Or it can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [19]:
# curation_app.run(port=8070, mode="external")

## 3. Co-occurrence network generation

Current curation table displayed in the curation app can be extracted using the `get_curated_table` method.

In [22]:
curated_occurrence_data = curation_app.get_curated_table()
curated_occurrence_data.sample(5)

Unnamed: 0_level_0,paper,section,paragraph,aggregated_entities,uid,definition,paper_frequency,entity_type
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
primitive disease,"{7101, 7067}","{7067:Ps, 7101:}","{7067:Ps:415, 7101::1}",[primitive disease],,,2,DISEASE
trpc family,"{14089, 14091}","{14091:193, 14089:Conclusions}","{14091:193:95, 14089:Conclusions::154}",[trpc family],,,2,PROTEIN
pip/taz,"{21934, 7107, 182688, 18225}",{18225:P1268 In Vitro Interactions Of Colistin...,{18225:P1268 In Vitro Interactions Of Colistin...,[pip/taz],,,4,PROTEIN
n-term,"{211696, 21652}","{21652:P.2.1-035, 211696:O-Glycan Lc-Ms Analys...",{211696:O-Glycan Lc-Ms Analysis ::: Methods::4...,[n-term],,,2,PROTEIN
transcutaneous electrical nerve stimulation,"{3000, 14098, 21842, 26813, 21944, 17184, 493,...","{21842:Evaluation Of Two, 14098:Caption, 14084...",{17184:Enteric Nervous System ::: Control Of B...,[ens],http://purl.obolibrary.org/obo/NCIT_C21032,A non-invasive form of electroanalgesia that u...,8,PROTEIN


Before we can proceed we need to convert paper/section and paragraph columns into `set`.

In [23]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(set)
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(set)
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(set)

We can also retreive current values of the `Keep` field (these entities will be also included in the resulting co-occurrence network).

In [24]:
curation_app.get_terms_to_include()

['glucose', 'covid-19']

### Generating co-occurrence networks

In the cell below we generate a paper-based entity co-occurrence network. Along with the network generation the `generate_comention_analysis` function:

- computes node centrality metrics (such as degree, RageRank)
- computes co-occurrence statistics (such as frequency, pointwise mutual information and normalized pointwise mutual information) and assignes them as weights to the corresponding edges
- performs entity community detection based on different co-occurrence statistics
- computes mutual-information-based minimum spanning trees.

Here we set the number of entities to include to the 1500 most frequent entities.

In [17]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_cooccurrence_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=1500,
    type_data=type_data, 
    factors=["paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=8)  # here set up the number of cores  
print("Done.")

-------------------------------
Factor: paragraph
-------------------------------
Fitering data.....
Selected 500 most frequent terms
Examining 124750 pairs of terms for co-occurrence...
Generated 114687 edges                    
Created a co-occurrence graph:
	number of nodes:  500
	number of edges:  114687
Saving the edges...
Creating a graph object...

Computing degree centrality statistics....
Top n nodes by frequency:
	covid-19 (145406)
	blood (122406)
	human (114167)
	infectious disorder (105703)
	lung (105662)
	heart (97281)
	diabetes mellitus (92386)
	sars-cov-2 (76233)
	mouse (73430)
	inflammation (72293)

Computing PageRank centrality statistics....
Top n nodes by frequency:
	covid-19 (0.02)
	blood (0.02)
	human (0.01)
	infectious disorder (0.01)
	lung (0.01)
	heart (0.01)
	diabetes mellitus (0.01)
	mouse (0.01)
	sars-cov-2 (0.01)
	inflammation (0.01)

Using the 'frequency' weight...
Detecting communities...
Best network partition:
	 Number of communities: 6
	 Modularity: 0.1

## 4. Network visualization and analysis

### Loading the generated graph into the visualization app

In [20]:
visualization_app.set_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree_object=trees["paragraph"], default_top_n=100)

visualization_app.set_current_graph("Paragraph-based graph")

### Loading papers' meta-data into the app

We now load an additional dataset containing some meta-data on the papers where the entities analyzed in this notebook occur.

In [21]:
paper_data = pd.read_csv("../data/Glucose_risk_3000_paper_meta_data.csv")
paper_data = paper_data.set_index("id")
paper_data.head(3)

Unnamed: 0_level_0,title,authors,abstract,doi,url,journal,pmc_id,pubmed_id,publish_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,Surfactant protein-D and pulmonary host defense,"Crouch, Erika C",Surfactant protein-D (SP-D) participates in th...,10.1186/rr19,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Respir Res,PMC59549,11667972.0,2000-08-25
56,CLINICAL VIGNETTES,,,10.1046/j.1525-1497.18.s1.20.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1494988,12753119.0,2003-04-01
58,Clinical Vignettes,,,10.1046/j.1525-1497.2001.0160s1023.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1495316,11357836.0,2001-04-01


We pass a callback for the lookup of paper meta-data to the visualization app using the `set_list_papers_callback` method.

In [22]:
def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

The ontology linking process described above is noisy, therefore, we would like to keep a possibility of accessing, the raw entities that were linked to particular ontology concepts. For this we define the function `get_aggregated_entities` that retreives such raw entities and we pass it to the visualization app using the `set_aggregated_entities_callback` method.

In [23]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

Finally, we create a dictionary `definitions` that will serve the visualization app as the lookup table for accessing the definitions of different ontology concepts.

In [24]:
definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

### Launching the visualization app

As before, the interactive graph visualization app can be launched in two modes: inline and external. Here we recommend the external mode for better user experience.

In [25]:
visualization_app.run(port=8081, mode="external")

Dash app running on http://127.0.0.1:8081/
