# Co-occurrence network analysis tutorial

In this notebook we will illustrate how interactive exploration and analysis of the [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset can be performed using the `cord19kg` package. The exploration and analysis techniques presented here focus on named entities and their co-occurrence in the scientific articles constituting the dataset.

The input data for this notebook contains the named entities extracted from a small selection of 20 articles representing 10 different entity types (i.e. proteins, chemicals, drugs, diseases, condtions, organs, organisms, pathways, cell types, cell compartments). The article selection corresponds to 20 most relevant articles to the query _"Glucose is a risk factor of COVID-19"_ obtained using the article search model [ref to blue brain search](). The entity extraction was performed using the Named Entity Recognition (NER) techniques also included in [ref to blue brain search]().

The `cord19kg` package provides a set of tools for interactive literature exploration through the named entity co-occurrence analysis consisting of the following steps:

1. __Data preparation__ step converts raw mentions into aggregated entity occurrence statistics.
2. __Data curation__ step allows the user to manage extracted entities: modify, filter them and link to the ontology.
3. __Network generation__ step allows creating entity co-occurrence networks based on paper-, section- and paragraph-level co-occurrence relations between entities. These entity relations are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).
4. __Network visualization and analysis__ step allows the user to perform interactive network visualization, edit network elements and perform its analysis (spanning tree, mutual-information based shortest paths between entities, etc).

In [1]:
import json
import zipfile

import pandas as pd

import dash_cytoscape as cyto

from cord19kg.utils import (generate_curation_table,
                            link_ontology,
                            generate_cooccurrence_analysis)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

The cell below loads additional graph layouts used in the graph visualization app

In [2]:
cyto.load_extra_layouts()

## 1. Data preparation

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a NER model.

In [3]:
data = pd.read_csv("../data/Glucose_risk_20_papers.csv")

In [4]:
data.sample(5)

Unnamed: 0,entity,entity_type,occurrence
2748,respiratory failure,DISEASE,214924:Introduction:6
3383,AMPK,PROTEIN,214924:The Interplay Between Covid-19 And Ampk...
2000,blood,ORGAN,197804:Discussion:39
1949,angiotensin-converting enzyme,PROTEIN,197804:Management Of Children And Young People...
840,COVID-19,DISEASE,184360:Diabetes As Promoter Of Severity And Mo...


On the first preparation step, we group and aggregate the input data by unique entities.

In [5]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....
Done.
CPU times: user 279 ms, sys: 8.35 ms, total: 287 ms
Wall time: 290 ms


The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of unique paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total entity occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- unique entity types assigned by the NER model (the `entity_type` column, multiple types are possible).
- raw entity types assigned by the NER model with the multiplicity of thier occurrence (the `raw_entity_types` column).


In [6]:
curation_input_table.sample(5)

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_entity_types,raw_frequency
210,pulmonary,ORGAN,[184360:Rhace2 As Decoy Factor ::: Therapeutic...,"[179426, 211373, 214924, 184360]","[214924:Introduction, 214924:The Immune Respon...",4,"[ORGAN, ORGAN, ORGAN, ORGAN, ORGAN, ORGAN, ORG...",16
243,t cell,CELL_TYPE,[184360:Gliptins ::: Therapeutic Potential Of ...,[184360],[184360:Cardiovascular Effects Of Sdpp4 Upregu...,1,"[CELL_TYPE, CELL_TYPE, CELL_TYPE]",3
200,organs,ORGAN,[214924:Angiotensin-Converting Enzyme 2 Expres...,"[214924, 211125]","[211125:Study Design And Participants, 214924:...",2,"[ORGAN, ORGAN, ORGAN, ORGAN]",4
16,ang ii,PROTEIN,[184360:Rhace2 As Decoy Factor ::: Therapeutic...,[184360],"[184360:Aceis And Arbs , 184360:Ang-(1-7) Anal...",1,"[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...",9
180,mice,ORGANISM,[179426:Role Of Ace/Arbs ::: Special Aspects O...,"[179426, 214924, 184360, 211373, 211125]","[184360:Aceis And Arbs , 184360:Mechanisms Of ...",5,"[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...",18


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [7]:
factor_counts

{'paper': 20, 'section': 108, 'paragraph': 286}

## 2. Data curation

### Loading the NCIT ontology linking data

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To be able to perform such ontology linking, we load some additional (pre-computed using ML-based linking models) data.

In [8]:
%%time
print("Loading the ontology linking data...")
    
print("\tDecompressing the input data file...")
with zipfile.ZipFile("../data/NCIT_ontology_linking_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")

print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv("../data/NCIT_ontology_linking_3000_papers.csv")

print("\tLoading ontology type mapping...")
with open("../data/NCIT_type_mapping.json", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
	Decompressing the input data file...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Done.
CPU times: user 1.14 s, sys: 203 ms, total: 1.35 s
Wall time: 1.37 s


The ontology linking table contains the following columns:
- `mention` entity mentioned in the text
- `concept` ontology concept linked to the entity mention
- `uid` unique identifier of the ontology concept
- `definition` definition of the concept
- `taxonomy` taxonomy of semantic types associated with the concept

In [9]:
ontology_linking.sample(5)

Unnamed: 0,mention,concept,uid,definition,taxonomy
120640,leishmania tropica,leishmania tropica,http://purl.obolibrary.org/obo/NCIT_C123511,A species of parasitic trypanosomatid protozoa...,[('http://purl.obolibrary.org/obo/NCIT_C123421...
88864,microvascular iliac,microcirculatory bed,http://purl.obolibrary.org/obo/NCIT_C33109,A collection of the smallest blood vessels of ...,[('http://purl.obolibrary.org/obo/NCIT_C12679'...
8095,hippocampal ca1,hippocampus,http://purl.obolibrary.org/obo/NCIT_C12444,A curved gray matter structure of the temporal...,[('http://purl.obolibrary.org/obo/NCIT_C13031'...
38269,dopaminergic neuronal,dopamine hydrochloride,http://purl.obolibrary.org/obo/NCIT_C455,"The hydrochloride salt form of dopamine, a mon...",[('http://purl.obolibrary.org/obo/NCIT_C29709'...
133923,pharyngoconjunctival fever,pharyngoconjunctival fever,http://purl.obolibrary.org/obo/NCIT_C34924,"A condition characterized by fever, conjunctiv...","[('http://purl.obolibrary.org/obo/NCIT_C3439',..."


### Interactive curation of  entity occurrence data

The package provides an interactive entity curation app that allows the user to visualize the entity occurrence data, modify it, perform ontology linking (see `Link to NCIT ontology` button), filter short or unfrequent entities.

The field `Keep` allows specifying a set of entities that must be kept in the dataset at all times (even if they don't satisfy the selected filtering criteria).

Finally the value specified in the `Generate Graphs from top 500 frequent entities` field corresponds to the number of top entities (by the frequency of their occurrence in papers) to be included in the co-occurrence network.

We load the prepared data table into the curation app as follows:

In [10]:
curation_app.set_table(curation_input_table.copy())

We can specify the default entities to keep.

In [11]:
default_entities_to_keep = ["glucose", "covid-19"]
curation_app.set_default_terms_to_include(default_entities_to_keep)

Finally, we set the ontology linking callback to be fired upon a click on the `Link to NCIT ontology` button.

In [12]:
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

### Launch the curation app

The application can be launched either inline (inside the current notebook) as below.

In [13]:
curation_app.run(port=8072, mode="inline")

Merging the occurrence data with the ontology linking...


Or it can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [14]:
# curation_app.run(port=8070, mode="external")

## 3. Co-occurrence network generation

Current curation table displayed in the curation app can be extracted using the `get_curated_table` method.

In [15]:
curated_occurrence_data = curation_app.get_curated_table()
curated_occurrence_data.head(5)

Unnamed: 0_level_0,paper,section,paragraph,aggregated_entities,uid,definition,paper_frequency,entity_type
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
4e-bp1,{214924},{214924:The Interplay Between Covid-19 And Amp...,{214924:The Interplay Between Covid-19 And Amp...,[4e-bp1],,,1,PROTEIN
ace inhibitor,"{197804, 184360}","{184360:Aceis And Arbs , 197804:Management Of ...",{197804:Management Of Children And Young Peopl...,[acei],http://purl.obolibrary.org/obo/NCIT_C247,Any substance that inhibits angiotensin-conver...,2,DRUG
acetaminophen,"{179426, 197804}",{197804:Management Of Children And Young Peopl...,"{197804:Discussion:52, 197804:Management Of Ch...","[acetaminophen, paracetamol]",http://purl.obolibrary.org/obo/NCIT_C198,A p-aminophenol derivative with analgesic and ...,2,DRUG
acute lung injury,"{179426, 214924, 197804, 184360}","{214924:The Immune Response To Sars-Cov-2 , 18...",{179426:Role Of Ace/Arbs ::: Special Aspects O...,"[ali, lung injury]",http://purl.obolibrary.org/obo/NCIT_C155766,A finding of acute onset of bilateral pulmonar...,4,DISEASE
acute respiratory distress syndrome,"{214924, 179426, 129074, 184360, 211125, 197804}","{214924:Introduction, 214924:The Immune Respon...","{214924:Introduction:4, 211125:Introduction:3,...","[acute respiratory distress syndrome, ards]",http://purl.obolibrary.org/obo/NCIT_C3353,Progressive and life-threatening pulmonary dis...,6,DISEASE


We can also retreive current values of the `Keep` field (these entities will be also included in the resulting co-occurrence network).

In [16]:
curation_app.get_terms_to_include()

['glucose', 'covid-19']

And the number of top frequent entities to use for network generation.

In [17]:
curation_app.n_most_frequent

500

### Generating co-occurrence networks

In the cell below we generate two co-occurrence networks: for paper- and paragraph-based entity co-occurrences. Along with the network generation the `generate_cooccurrence_analysis` function 
- computes node centrality metrics (such as degree, RageRank)
- computes co-occurrence statistics (such as frequency, pointwise mutual information and normalized pointwise mutual information) and assignes them as weights to the corresponding edges
- performs entity community detection based on different co-occurrence statistics
- computes mutual-information-based minimum spanning trees.

In [18]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_cooccurrence_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=curation_app.n_most_frequent if curation_app.n_most_frequent else 100,
    type_data=type_data, 
    factors=["paper", "paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=8, backend="networkx")

print("Done.")

-------------------------------
Factor: paper
-------------------------------
Examining 23005 pairs of terms for co-occurrence...
-------------------------------
Factor: paragraph
-------------------------------
Examining 23005 pairs of terms for co-occurrence...
Done.
CPU times: user 2.5 s, sys: 369 ms, total: 2.86 s
Wall time: 12.7 s


## 4. Network visualization and analysis

### Loading the generated graphs into the visualization app

In [19]:
visualization_app.set_graph(
    "Paper-based graph", graphs["paper"],
    tree=trees["paper"], default_top_n=100)

visualization_app.set_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree=trees["paragraph"], default_top_n=100)

visualization_app.set_current_graph("Paragraph-based graph")

### Loading papers' meta-data into the app

We now load an additional dataset containing some meta-data on the papers where the entities analyzed in this notebook occur.

In [20]:
paper_data = pd.read_csv("../data/Glucose_risk_3000_paper_meta_data.csv")
paper_data = paper_data.set_index("id")
paper_data.head(3)

Unnamed: 0_level_0,title,authors,abstract,doi,url,journal,pmc_id,pubmed_id,publish_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,Surfactant protein-D and pulmonary host defense,"Crouch, Erika C",Surfactant protein-D (SP-D) participates in th...,10.1186/rr19,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Respir Res,PMC59549,11667972.0,2000-08-25
56,CLINICAL VIGNETTES,,,10.1046/j.1525-1497.18.s1.20.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1494988,12753119.0,2003-04-01
58,Clinical Vignettes,,,10.1046/j.1525-1497.2001.0160s1023.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1495316,11357836.0,2001-04-01


We pass a callback for the lookup of paper meta-data to the visualization app using the `set_list_papers_callback` method.

In [21]:
def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

The ontology linking process described above is noisy, therefore, we would like to keep a possibility of accessing, the raw entities that were linked to particular ontology concepts. For this we define the function `get_aggregated_entities` that retreives such raw entities and we pass it to the visualization app using the `set_aggregated_entities_callback` method.

In [22]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

Finally, we create a dictionary `definitions` that will serve the visualization app as the lookup table for accessing the definitions of different ontology concepts.

In [23]:
definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

### Launching the visualization app

As before, the interactive graph visualization app can be launched in two modes: inline and external. Here we recommend the external mode for better user experience.

In [24]:
visualization_app.run(port=8082, mode="external")

Dash app running on http://127.0.0.1:8082/


In [25]:
graphs

{'paper': <bluegraph.core.io.PandasPGFrame at 0x7fa42a43bb00>,
 'paragraph': <bluegraph.core.io.PandasPGFrame at 0x7fa43a69b5f8>}

In [26]:
from bluegraph.backends.networkx import pgframe_to_networkx, NXPathFinder

In [27]:
from bluegraph.core.analyse.paths import graph_elements_from_paths

In [28]:
f = NXPathFinder(graphs["paper"])

In [29]:
f.graph.nodes()

NodeView(('acetaminophen', 'acute lung injury', 'acute respiratory distress syndrome', 'adipose tissue', 'angioedema', 'angiotensin ii receptor antagonist', 'angiotensin-2', 'angiotensin-converting enzyme', 'angiotensin-converting enzyme 2', 'anxiety', 'basal', 'blood', 'bradykinin', 'caax prenyl protease 2', 'cardiovascular disorder', 'cardiovascular system', 'chemokine', 'chest pain', 'child', 'chloroquine', 'chronic disease', 'chronic kidney disease', 'comorbidity', 'confounding factors', 'coronaviridae', 'coronavirus', 'cough', 'covid-19', 'death', 'degradation', 'diabetes mellitus', 'diabetic ketoacidosis', 'diarrhea, ctcae', 'dipeptidyl peptidase 4', 'dpp-4i', 'dpp4i', 'fever', 'glucose', 'glucose metabolism disorder', 'glyburide', 'glycosylated hemoglobin measurement', 'growth factor', 'h1n1', 'hcp', 'headache', 'heart', 'hmg-coa reductase inhibitor', 'humoral immunity', 'hyperglycemia', 'hypertension', 'ibuprofen', 'infectious disorder', 'inflammation', 'influenza', 'insulin', 

In [30]:
paths = f.n_shortest_paths("glucose", "inflammation", 10)

In [47]:
def get_subgraph(graph, nodes_to_exclude=None, edges_to_exclude=None):
    """Produce a graph induced by the input nodes."""
    if nodes_to_exclude is None:
        nodes_to_exclude = []
    nodes_to_include = [
        n for n in graph.nodes()
        if n not in nodes_to_exclude
    ]

    subgraph = graph.subgraph(nodes_to_include)

    if edges_to_exclude is not None:
        print(edges_to_exclude)
        subgraph = subgraph.edge_subgraph(
            [e for e in subgraph.edges() if e not in edges_to_exclude]
        )

    return subgraph

In [50]:
nx_graph.subgraph?

In [67]:
def sugraph_from_paths(graph, paths):
    nodes, edges = graph_elements_from_paths(paths)
    subgraph = graph.subgraph(nodes).edge_subgraph(edges)
    print(edges)
    return subgraph

In [68]:
path_graph = sugraph_from_paths(nx_graph, paths)

{('glucose', 'glycosylated hemoglobin measurement'), ('glycosylated hemoglobin measurement', 'inflammation'), ('glucose', 'growth factor'), ('headache', 'inflammation'), ('hcp', 'inflammation'), ('glucose', 'glyburide'), ('h1n1', 'inflammation'), ('glucose', 'h1n1'), ('glyburide', 'inflammation'), ('growth factor', 'inflammation'), ('glucose', 'headache'), ('glucose', 'heart failure'), ('glucose', 'hcp'), ('heart failure', 'inflammation'), ('glucose', 'inflammation'), ('glucose', 'heart'), ('heart', 'inflammation'), ('glucose metabolism disorder', 'inflammation'), ('glucose', 'glucose metabolism disorder')}


In [69]:
path_graph.edges()

OutEdgeView([('growth factor', 'inflammation'), ('hcp', 'inflammation'), ('heart failure', 'inflammation'), ('heart', 'inflammation'), ('glyburide', 'inflammation'), ('h1n1', 'inflammation'), ('headache', 'inflammation'), ('glycosylated hemoglobin measurement', 'inflammation'), ('glucose', 'glucose metabolism disorder'), ('glucose', 'glyburide'), ('glucose', 'glycosylated hemoglobin measurement'), ('glucose', 'growth factor'), ('glucose', 'h1n1'), ('glucose', 'hcp'), ('glucose', 'headache'), ('glucose', 'heart'), ('glucose', 'heart failure'), ('glucose', 'inflammation'), ('glucose metabolism disorder', 'inflammation')])