## Google Colab Prerequisites and installation instructions 

If you want to run this notebook in Google Colab, run the following cell. Otherwise follow [installation instructions](https://github.com/BlueBrain/BlueGraph/blob/master/README.rst#installation) to install BlueGraph and its dependencies locally.

In [None]:
# Install bluegraph
! git clone https://github.com/BlueBrain/BlueGraph
! cd BlueGraph && pip install .[cord19kg]

# Install graph-tool
!echo "deb http://downloads.skewed.de/apt bionic main" >> /etc/apt/sources.list
!apt-key adv --keyserver keys.openpgp.org --recv-key 612DEFB798507F25
!apt-get update
!apt-get install python3-graph-tool=2.58 python3-cairo python3-matplotlib

DATA_PATH = "BlueGraph/cord19kg/examples/data/"

## Local environment Prerequisites and installation instructions

In order to run this notebook `graph-tool` must be installed manually (it cannot be installed as a part of `pip install bluegraph`, as it is not an ordinary Python library, but a wrapper around a C++ library). Please, see [graph-tool installation instructions](https://git.skewed.de/count0/graph-tool/-/wikis/installation-instructions#native-installation) (currently, BlueGraph supports :code:`graph-tool<=2.37`.)

We recommend using `conda` for installing `graph-tool`. For example:

```
conda install -c conda-forge graph-tool==2.37
```

or as a part of a new `conda` environment:

```
conda create --name <your_environment> -c conda-forge graph-tool==2.37
conda activate <your_environment>
```


BlueGraph and the set of dependecies supporting custom tools for CORD-19 analysis can be installed using:

 ```
 pip install bluegraph[cord19kg]
 ```

# Topic-centered co-occurrence network analysis of BBP BlueSearch NER output

In this notebook we will perform interactive exploration and analysis of a topic-centered subset of the [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset using the `cord19kg` package. The exploration and analysis techniques presented here focus on named entities and their co-occurrence in the scientific articles constituting the dataset.

The input data for this notebook contains the named entities extracted articles using the Named Entity Recognition (NER) techniques included in [BlueSearch](https://github.com/BlueBrain/Search).

The interactive literature exploration through the named entity co-occurrence analysis consisting of the following steps:

1. __Data preparation__ step converts raw mentions into aggregated entity occurrence statistics.
2. __Data curation__ step allows the user to manage extracted entities: modify, filter them and link to the ontology.
3. __Network generation__ step allows creating entity co-occurrence networks based on paper-, section- and paragraph-level co-occurrence relations between entities. These entity relations are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).
4. __Network visualization and analysis__ step allows the user to perform interactive network visualization, edit network elements and perform its analysis (spanning tree, mutual-information based shortest paths between entities, etc).

In [71]:
import json
import os
import zipfile

import pandas as pd

import dash_cytoscape as cyto

from kgforge.core import KnowledgeGraphForge

from cord19kg.utils import (generate_curation_table,
                           link_ontology,
                           generate_cooccurrence_analysis,
                           download_from_nexus)
from bluegraph.backends.graph_tool import (GTMetricProcessor,
                                           GTPathFinder,
                                           GTGraphProcessor,
                                           GTCommunityDetector)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

In [72]:
cyto.load_extra_layouts()

## 1. Data preparation

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a NER model.
The dataset is stored in Blue Brain Nexus.

In [73]:
# Blue Brain Nexus bucket to download data from

nexus_bucket = "bbp/kg-inference"
nexus_endpoint = "https://bbp.epfl.ch/nexus/v1"
nexus_config_file = f"{DATA_PATH}../config/data-download-nexus.yml"


### Get an authentication token

For now, the [Nexus web application](https://bbp.epfl.ch/nexus/web) can be used to get a token. We are looking for other simpler alternatives.

- Step 1: From the opened web page, click on the login button on the right corner and follow the instructions.

![login-ui](./login-ui.png)

- Step 2: At the end you’ll see a token button on the right corner. Click on it to copy the token.

![login-ui](./copy-token.png)


In [74]:
import getpass

In [11]:
TOKEN = getpass.getpass()

In [75]:
from kgforge.core import KnowledgeGraphForge, Resource
from more_itertools import bucket

forge = KnowledgeGraphForge(nexus_config_file, bucket=nexus_bucket, endpoint=nexus_endpoint, token=TOKEN)


### Download Data

In [76]:
data_version = "v1_13_10_2022"
try:
    print(f"Data path: '{DATA_PATH}'")
except NameError:
    DATA_PATH = "../data/"
    print(f"Data path: '{DATA_PATH}'")

Data path: '../data/'


In [77]:
%%time
ner_output_resource = forge.retrieve("https://bbp.epfl.ch/neurosciencegraph/data/2680c892-d1ae-4c60-8e1f-d917acb13ae8", version=data_version),
forge.download(ner_output_resource, "distribution.contentUrl", DATA_PATH)
print("Done.")

Done.
CPU times: user 56.7 s, sys: 1min 6s, total: 2min 3s
Wall time: 1min 10s


In [78]:
data = pd.read_csv(f"{DATA_PATH}/results_v1.csv")

In [79]:
data['occurrence'] = data['paper_id'].map(str) + ':_:' + data['paragraph_id'].map(str)
data=data[['entity', 'entity_type', 'occurrence', 'start', 'end']]

In [80]:
data.sample(5)

Unnamed: 0,entity,entity_type,occurrence,start,end
2379512,GroEL,GENE,b3abe505ec16bed01668f7ef00230733:_:n9KMiYMBxas...,385,390
2703541,SMA,BRAIN_REGION,cdfde5d13d2b1bd60f578a681ecde0b5:_:e9OMiYMBxas...,194,197
428471,nerve,CELL_TYPE,1baf7c706e879341f550b6e49d4c5ae0:_:D9CLiYMBxas...,132,137
1291408,CYP46A1,GENE,6f26ff7993fcb06001aaf430585ff927:_:3HWMiYMBVjZ...,1683,1690
102681,neuron,CELL_TYPE,0097295b0935e432340c19aa7290a392:_:UHOKiYMBVjZ...,423,429


On the first preparation step, we group and aggregate the input data by unique entities.

In [81]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Cleaning up the entities...
Aggregating occurrences of entities....
Done.
CPU times: user 40.4 s, sys: 16.5 s, total: 56.9 s
Wall time: 42.1 s


The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of unique paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total entity occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- unique entity types assigned by the NER model (the `entity_type` column, multiple types are possible).
- raw entity types assigned by the NER model with the multiplicity of thier occurrence (the `raw_entity_types` column).


In [82]:
curation_input_table.sample(5)

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_entity_types,raw_frequency
47692,wnk,GENE,[33d54df8ad7570b3ffac0c1ffd60e84d:_:TXOLiYMBVj...,"[c01c645d9762581bfe2dc12f72fa3971, b85a85bf778...","[a20dc9b71756721449293894b673745a:_, 92e3a5749...",8,"[GENE, GENE, GENE, GENE, GENE, GENE, GENE, GEN...",39
8391,chiasmal,BRAIN_REGION,[ab1805bc029628ba8ccdd2e9b9ad211f:_:a3aMiYMBVj...,"[c1032b821f35d30411c384c9119cf693, 03bc9f56817...","[f80475bc40dc21ebdb392d16db947937:_, c1032b821...",4,"[BRAIN_REGION, BRAIN_REGION, BRAIN_REGION, BRA...",4
21565,ip _ 3r2,GENE,[14d8fe987bf0afdd1aaea03305ea8b9c:_:zHOLiYMBVj...,"[14d8fe987bf0afdd1aaea03305ea8b9c, 4db4d1198a1...","[2043fa78c1d9583b62c977b50ebd7fae:_, 176b02cdc...",12,"[GENE, GENE, GENE, GENE, GENE, GENE, GENE, GEN...",94
47169,vlps,GENE,[0f62e797793c13b1cff52bb6324f430b:_:Fs-LiYMBxa...,"[0f62e797793c13b1cff52bb6324f430b, ee35c3d19ed...","[ee35c3d19edaf5335862334657cc4432:_, 0f62e7977...",2,"[GENE, GENE]",2
25829,medullary reticular formation,BRAIN_REGION,[5f6ec871f3fe27602997e4d5a8ae307b:_:ptGLiYMBxa...,"[5aeabadb9163b09b6cd459c091c60cdd, f4b8de0a3ab...","[87b5c594adb5851d2f50e3860374f9bd:_, 9bda00f72...",13,"[BRAIN_REGION, BRAIN_REGION, BRAIN_REGION, BRA...",18


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [83]:
factor_counts

{'paper': 12292, 'section': 12292, 'paragraph': 344610}

### 2. Data curation

#### Loading the NCIT ontology linking data (To be updated with BBP ontologies)

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To be able to perform such ontology linking, we load some additional (pre-computed using ML-based linking models) data.

In [84]:
%%time
print("Loading the ontology linking data...")

download_from_nexus(
    uri=f"{nexus_endpoint}/resources/covid19-kg/data/_/4fde1f8f-ee7f-435e-95a8-abb79139db93",
    output_path=DATA_PATH, config_file_path=nexus_config_file,
    nexus_endpoint=nexus_endpoint, nexus_bucket="covid19-kg/data", unzip=True)
print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv(f"{DATA_PATH}/NCIT_ontology_linking_3000_papers.csv")

print("Loading ontology type mapping...")
ontology_linking_type_mapping_data = download_from_nexus(
    uri=f"{nexus_endpoint}/resources/covid19-kg/data/_/92bc2a04-6003-4f4d-85e1-dcc5f2352df2", 
    output_path=DATA_PATH, config_file_path=nexus_config_file,
    nexus_endpoint=nexus_endpoint, nexus_bucket="covid19-kg/data")
with open(f"{DATA_PATH}/{ontology_linking_type_mapping_data.distribution.name}", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
Downloading the file to '../data/NCIT_ontology_linking_3000_papers.csv.zip'
Decompressing ...
	Loading the linking dataframe in memory...
Loading ontology type mapping...
Downloading the file to '../data/NCIT_type_mapping.json'
Done.
CPU times: user 43.8 s, sys: 51.7 s, total: 1min 35s
Wall time: 58.3 s


The ontology linking table contains the following columns:
- `mention` entity mentioned in the text
- `concept` ontology concept linked to the entity mention
- `uid` unique identifier of the ontology concept
- `definition` definition of the concept
- `taxonomy` taxonomy of semantic types associated with the concept

In [85]:
ontology_linking.sample(5)

Unnamed: 0,mention,concept,uid,definition,taxonomy
67428,type ii respiratory failure,"respiratory failure, ctcae",http://purl.obolibrary.org/obo/NCIT_C143809,A disorder characterized by impaired gas excha...,[('http://purl.obolibrary.org/obo/NCIT_C143181...
71989,a-lipoic acid,alpha-lipoic acid,http://purl.obolibrary.org/obo/NCIT_C61595,"A naturally occurring micronutrient, synthesiz...","[('http://purl.obolibrary.org/obo/NCIT_C1505',..."
87702,buffered water,buffered,http://purl.obolibrary.org/obo/NCIT_C63343,The property of being able to chemically neutr...,[('http://purl.obolibrary.org/obo/NCIT_C27993'...
141068,nonphosphorus,phosphate measurement,http://purl.obolibrary.org/obo/NCIT_C64857,A quantitative measurement of the amount of ph...,[('http://purl.obolibrary.org/obo/NCIT_C74946'...
86642,amino acid insert,amino acid,http://purl.obolibrary.org/obo/NCIT_C231,Any organic compounds containing amino (-NH2) ...,"[('http://purl.obolibrary.org/obo/NCIT_C232', ..."


#### Interactive curation of  entity occurrence data

The package provides an interactive entity curation app that allows the user to visualize the entity occurrence data, modify it, perform ontology linking (see `Link to NCIT ontology` button), filter short or unfrequent entities.

The field `Keep` allows specifying a set of entities that must be kept in the dataset at all times (even if they don't satisfy the selected filtering criteria).

Finally the value specified in the `Generate Graphs from top N frequent entities` field corresponds to the number of top entities (by the frequency of their occurrence in papers) to be included in the co-occurrence network.

We load the prepared data table into the curation app as follows:

In [86]:
curation_app.set_table(curation_input_table.copy())

We can specify the default entities to keep.

In [87]:
default_entities_to_keep = []
curation_app.set_default_terms_to_include(default_entities_to_keep)

Finally, we set the ontology linking callback to be fired upon a click on the `Link to NCIT ontology` button.

In [88]:
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

##### Launch the curation app

The application can be launched either inline (inside the current notebook) as below. Note that if you run this notebook in Colab, you may want to set a lower number of entities to include, in order to avoid long generation time ('Generate Graphs from top N frequent entities' field). Current default value is 200.

In [89]:
curation_app.run(port=8074, mode="external")

Opening port number 8074 failed: Address 'http://127.0.0.1:8074' already in use.
    Try passing a different port to run_server.. Trying port number 8075 ...



The 'environ['werkzeug.server.shutdown']' function is deprecated and will be removed in Werkzeug 2.1.



Dash app running on http://127.0.0.1:8075/


Exception in thread Thread-1167:
Traceback (most recent call last):
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 200, i

Merging the occurrence data with the ontology linking...



Dropping invalid columns in SeriesGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the aggregating function.



Merging the occurrence data with the ontology linking...


Traceback (most recent call last):
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/cord19kg/apps/curation_app.py", line 638, in update_output
    if data_row["aggregated_entities"] ==\
KeyError: 'aggregated_entities'


Merging the occurrence data with the ontology linking...
Merging the occurrence data with the ontology linking...


Traceback (most recent call last):
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/cord19kg/apps/curation_app.py", line 720, in update_output
    dff = dff.loc[dff[col_name].str.contains(filter_value)]
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/pandas/core/strings/accessor.py", line 116, in wrapper
    return func(self, *args, **kwargs)
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/pandas/core/strings/accessor.py", line 1153, in contains
    if regex and re.compile(pat).groups:
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/re.py", line 236, in compile
    return _compile(pattern, flags)
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/re.py", line 288, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/mfsy/anaconda3/envs/bluegraph/li

Merging the occurrence data with the ontology linking...
Merging the occurrence data with the ontology linking...


Or it can be opened externally (by the URL that you can open in a separate tab of your browser, try uncommenting, executing and doing Ctrl+Click on the displayed URL).

In [None]:
# curation_app.run(port=8070, mode="external")

## 3. Co-occurrence network generation

Current curation table displayed in the curation app can be extracted using the `get_curated_table` method.

In [90]:
curated_occurrence_data = curation_app.get_curated_table()
curated_occurrence_data.sample(3)

Unnamed: 0_level_0,paper,section,paragraph,aggregated_entities,uid,definition,paper_frequency,entity_type
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pde4d2,"{874bab5504dc9322a9e0126b74df2eb7, 7a78b7030aa...","{874bab5504dc9322a9e0126b74df2eb7:_, 7a78b7030...",{7a78b7030aa13d9eb00e5404806a478e:_:7dGMiYMBxa...,[pde4d2],,,2,GENE
it -15,"{6d77e9eee0125813b9d09283038edb63, 9aa7e25db35...","{9aa7e25db35f38934da5bfa5fb22fa95:_, 6d77e9eee...",{6d77e9eee0125813b9d09283038edb63:_:SnWMiYMBVj...,[it -15],,,2,GENE
lox,"{43fac8bfea63c4a275ac324ce58ed17a, 1a31f09b43c...","{1573e342193b4fb2ffbb5d9b1d6a5989:_, 487b08b86...",{4fe1abeeaa80dcfe36b3a47a58cce14c:_:oXSLiYMBVj...,"[lox, loxl2, loxs]",,,53,GENE


Before we can proceed we need to convert paper/section and paragraph columns into `set`.

In [91]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(set)
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(set)
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(set)

We can also retreive current values of the `Keep` field (these entities will be also included in the resulting co-occurrence network).

In [92]:
curation_app.get_terms_to_include()

[]

In [50]:
# Save curation table 
curated_occurrence_data.to_csv("./curated_linked_table.csv")

### Generating co-occurrence networks

In this section, we generate a paragraph-based entity co-occurrence network (i.e. an edge between a pair of entities is added if they co-occur in the same paragraph). Also some grapah and co-occurrence statistics are generated:

* computes co-occurrence statistics (such as frequency, pointwise mutual information and normalized pointwise mutual information) and assignes them as weights to the corresponding edges 

    * `ppmi`: given two entities (x, y), their _positive pointwise mutual information (PPMI)_ is defined as $PPMI(x, y) = \log_2{\frac{p(x, y)}{p(x)p(y)}} $, if $p(x) \neq 0$ and $p(y) \neq 0$, and $PPMI(x, y) = 0$ otherwise.

    * `npmi`: given two entities (x, y), their_normalized pointwise mutual information (NPMI)_ is defined as $NPMI(x, y) = \frac{PPMI(x, y)}{-\log_2{p(x, y)}} $.

* computes node centrality metrics (such as degree, RageRank) and them as weights to the nodes
* performs entity community detection based on different co-occurrence statistics
* computes mutual-information-based minimum spanning trees.

Before we run the co-occurrence analysis, we will create a dictionary with backend configurations for the analytics:

* we set metrics (centalities) computation to use `graph_tool`, 
* community detection to use `networkx` 
* and, finally, path search to use `graph_tool` as well.

In [93]:
import time

In [94]:
import time
backend_configs = {
    "metrics": "graph_tool",
    "communities": "networkx",
    "paths": "graph_tool"
}

In [95]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_cooccurrence_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=curation_app.n_most_frequent,
    type_data=type_data, 
    factors=["paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=8,  # here set up the number of cores
    backend_configs=backend_configs)
print("Done.")

-------------------------------
Factor: paragraph
-------------------------------
Done.
CPU times: user 4min 28s, sys: 8min 7s, total: 12min 36s
Wall time: 5min 43s


In [96]:
graphs["paragraph"].edges(raw_frame=True).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,frequency,ppmi,npmi,distance_npmi
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ampar,neuronal cells,6,0.190739,0.012065,82.886474
bcl -2,cr1,1,0.0,0.0,inf
gabaergic,raas,2,0.0,0.0,inf
plasma,stat3 measurement,15,0.0,0.0,inf
auditory cortex,hpa,1,0.0,0.0,inf


In [97]:
graphs["paragraph"].nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paragraph_frequency,entity_type,paper,degree_frequency,pagerank_frequency,community_frequency,community_npmi
@id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
radiation arc,Entity,578,GENE,"[1d96334d3820341f5f9473701606cd4e, c70139d0138...",3829.0,0.000878,1,1
interleukin-19,Entity,2008,PROTEIN,"[e90c3d9010f8ebdf279d5f7f504f5872, 6d2c3d6f2b5...",10557.0,0.001804,1,1
plasma,Entity,7378,ORGAN,"[23ba5fbabbdb2eb72f86904f80a0b21a, 103ec4ece0c...",22777.0,0.003754,4,3
gamma-aminobutyric acid,Entity,4728,CHEMICAL,"[732c9daf5a1b250e48dbf401a859d60f, 83a05bb9dc3...",27135.0,0.004232,1,1
cornea,Entity,462,BRAIN_REGION,"[9c8111f75db4249dc17b2eff623444eb, c7c3767367b...",1125.0,0.000466,1,4


#### Save or load co-occurence graph and spaning tree

In [56]:
co_occurence_graph_nodes_path ="./co-occurence-graph-nodes.csv"
co_occurence_graph_edges_path ="./co-occurence-graph-edges.csv"

co_occurence_tree_nodes_path ="./co-occurence-tree-nodes.csv"
co_occurence_tree_edges_path ="./co-occurence-tree-edges.csv"


In [57]:
# Save co-occurence graph
graphs["paragraph"].to_csv(co_occurence_graph_nodes_path, co_occurence_graph_edges_path)

In [58]:
# Save spaning tree
trees["paragraph"].to_csv(co_occurence_tree_nodes_path, co_occurence_tree_edges_path)

In [None]:
# Load saved co-occurence graph and spanning tree
graphs = {}
trees = {}
graphs["paragraph"]=PGFrame.from_csv(co_occurence_graph_nodes_path, co_occurence_tree_nodes_path)
graphs["paragraph"]=PGFrame.from_csv(co_occurence_graph_nodes_path, co_occurence_tree_edges_path)

## 4. Co-occurrence network visualization and analysis

### Programmatic analysis of the co-occurrence network 

#### Nearest neighours by co-occurrence scores

To illustrate the importance of computing mutual-information-based scores over raw frequencies, consider the following example, where we would like to estimate top closest (most related) neighbors to a specific term.

To do so, we will use the paragraph-based network and the raw co-occurrence frequency as the weight of our co-occurrence relation. The `top_neighbors` method of the `PathFinder` interface provided by the BlueGraph allows us to search for top neighbors with the highest edge weight. In this example, we use `graph_tool`-based `GTPathFinder` interface.

In [None]:
paragraph_path_finder = GTPathFinder(graphs["paragraph"], directed=False)

In [None]:
paragraph_path_finder.top_neighbors("hippocampus", 10, weight="frequency")

In [None]:
paragraph_path_finder.top_neighbors("neuron", 10, weight="frequency")

(Closest inspection of the distribution of weighted node degrees suggests that the network contains _hubs_, nodes with significantly high-degree connectivity to other nodes.)

In [None]:
graphs["paragraph"]._nodes

In [None]:
graphs["paragraph"].nodes(raw_frame=True).nlargest(10, columns=["paragraph_frequency"])

To account for the presence of such hubs, we use the mutual-information-based scores presented above. They 'balance' the influence of the highly connected hub.

In [None]:
paragraph_path_finder.top_neighbors("hippocampus", 10, weight="npmi")

In [None]:
paragraph_path_finder.top_neighbors("neuron", 10, weight="npmi")

#### IV. Graph metrics and centrality measures

BlueGraph provides the `MetricProcessor` interface for computing various graph statistics. As in the previous example, we will use `graph_tool`-based `GTMetricProcessor` interface.

In [None]:
#paper_metrics = GTMetricProcessor(paper_network, directed=False)
paragraph_metrics = GTMetricProcessor(graphs["paragraph"], directed=False)

##### Graph density

Density of a graph is quantified by the proportion of all possible edges ($n(n-1) / 2$ for the undirected graph with $n$ nodes) that are realized.

In [None]:
#print("Density of the paper-based network: ", paper_metrics.density())
print("Density of the paragraph-based network: ", paragraph_metrics.density())

The results above show that in the paragraph network, 39% of all possible term pairs co-occur at least once.

##### Node centrality (importance) measures

In this example we will compute the Degree and PageRank centralities only for the raw frequency, and the Betweenness centrality for the mutual-information-based scores. We will use methods provided by the `MetricProcessor` interface in the _write_ mode, i.e. computed metrics will be written as node properties of the underlying graph object.

_Degree centrality_ is given by the sum of weights of all incident edges of the given node and characterizes the importance of the node in the network in terms of its connectivity to other nodes (high degree = high connectivity).

In [None]:
paragraph_metrics.degree_centrality("frequency", write=True, write_property="degree")

_PageRank centrality_ is another measure that estimated the importance of the given node in the network. Roughly speaking it can be interpreted as the probablity that having landed on a random node in the network we will jump to the given node (here the edge weights are taken into account").

https://en.wikipedia.org/wiki/PageRank

In [None]:
paragraph_metrics.pagerank_centrality("frequency", write=True, write_property="pagerank")

We then compute the betweenness centrality based on the NPMI distances.

_Betweenness centrality_ is a node importance measure that estimates how often a shortest path between a pair of nodes will pass through the given node.

In [None]:
paragraph_metrics.betweenness_centrality("distance_npmi", write=True, write_property="betweenness")

We can inspect the underlying graph object and observe the newly added properties:

In [None]:
paragraph_metrics.graph.vp.keys()

Now, we will export this backend-specific graph object into a `PGFrame`.

In [None]:
new_paragraph_network = paragraph_metrics.get_pgframe()

In [None]:
new_paragraph_network.nodes(raw_frame=True).sample(5)

In [None]:
print("Top 10 nodes by degree")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["degree"]).index:
    print("\t", n)

In [None]:
print("Top 10 nodes by PageRank")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["pagerank"]).index:
    print("\t", n)

In [None]:
print("Top 10 nodes by betweenness")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["betweenness"]).index:
    print("\t", n)

##### Compute multiple metrics in one go

Alternatively, we can compute all the metrics in one go. To do so, we need to specify edge attributes used for computing different metrics (if an empty list is specified as a weight list for a metric, computation of this metric is not performed). 

We select the paragraph-based network and re-compute all some of the previously illustrated metrics as follows:

In [None]:
result_metrics = paragraph_metrics.compute_all_node_metrics(
    degree_weights=["frequency"],
    pagerank_weights=["frequency"],
    betweenness_weights=["distance_npmi"])

In [None]:
result_metrics

### Interactive web-based analysis of the co-occurrence network 

#### Loading the generated graph into the visualization app

First of all, we set a backend for the visualization app (currently two backends are supported: based on `NetworkX` and `graph-tool`, in this example we use the latter).

In [98]:
visualization_app.set_backend("graph_tool")

In [None]:
# #  Run the following use NetworkX as the backend for the visualization app
# visualization_app.set_backend("networkx")

In [99]:
visualization_app.add_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree=trees["paragraph"], default_top_n=100)

visualization_app.set_current_graph("Paragraph-based graph")

In [100]:
processor = visualization_app._graph_processor.from_graph_object(
        visualization_app._graphs["Paragraph-based graph"]["object"])
#processor.get_node("subcortical")

selected_node = "subcortical"
weights = {}
other_weights = {}
types= {}
for n in processor.neighbors(selected_node):
        other_weights[n] = {}
        weights[n] = processor.get_edge(selected_node, n)["npmi"]
        _node= processor.get_node(n)
        other_weights[n]["paragraph_frequency"] = _node["paragraph_frequency"]
        other_weights[n]["degree_frequency"] = _node["degree_frequency"]
        other_weights[n]["pagerank_frequency"] = _node["pagerank_frequency"]
        other_weights[n]["num_paper"] = len(_node["paper"])
        types[n] = _node["entity_type"] if "entity_type" in _node else None

#top_neighbors = top_n(weights, neighborlimit)
#print(weights)

#### Loading papers' meta-data into the app

We now load an additional dataset containing some meta-data on the papers where the entities analyzed in this notebook occur.

In [101]:
paper_medata = download_from_nexus(uri=f"{nexus_endpoint}/resources/covid19-kg/data/_/8fc1e60c-1ebe-4173-82c0-9775a4917041",
                         output_path = DATA_PATH, config_file_path=nexus_config_file, nexus_endpoint=nexus_endpoint, nexus_bucket="covid19-kg/data")
paper_data = pd.read_csv(f"{DATA_PATH}/{paper_medata.distribution.name}")
paper_data = paper_data.set_index("id")
paper_data.head(3)

Downloading the file to '../data/Glucose_risk_3000_paper_meta_data.csv'


Unnamed: 0_level_0,title,authors,abstract,doi,url,journal,pmc_id,pubmed_id,publish_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,Surfactant protein-D and pulmonary host defense,"Crouch, Erika C",Surfactant protein-D (SP-D) participates in th...,10.1186/rr19,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Respir Res,PMC59549,11667972.0,2000-08-25
56,CLINICAL VIGNETTES,,,10.1046/j.1525-1497.18.s1.20.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1494988,12753119.0,2003-04-01
58,Clinical Vignettes,,,10.1046/j.1525-1497.2001.0160s1023.x,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,J Gen Intern Med,PMC1495316,11357836.0,2001-04-01


We pass a callback for the lookup of paper meta-data to the visualization app using the `set_list_papers_callback` method.

In [102]:
def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

The ontology linking process described above is noisy, therefore, we would like to keep a possibility of accessing, the raw entities that were linked to particular ontology concepts. For this we define the function `get_aggregated_entities` that retreives such raw entities and we pass it to the visualization app using the `set_aggregated_entities_callback` method.

In [103]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

Finally, we create a dictionary `definitions` that will serve the visualization app as the lookup table for accessing the definitions of different ontology concepts.

In [104]:
definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

#### Launching the visualization app

As before, the interactive graph visualization app can be launched in two modes: inline and external. Here we recommend the external mode for better user experience.

In [105]:
visualization_app.run(port=8081, mode="external")

Opening port number 8081 failed: Address 'http://127.0.0.1:8081' already in use.
    Try passing a different port to run_server.. Trying port number 8082 ...



The 'environ['werkzeug.server.shutdown']' function is deprecated and will be removed in Werkzeug 2.1.



Dash app running on http://127.0.0.1:8082/


Exception in thread Thread-1827:
Traceback (most recent call last):
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/Users/mfsy/anaconda3/envs/bluegraph/lib/python3.7/site-packages/retrying.py", line 200, i

invalid literal for int() with base 10: '6141b83cd0b2f02ff4f7bbc72a3a44e9'
