# Literature exploration by term co-occurrance analysis

In this example we illustrate how network analytics can be used for literature exploration. 

The input dataset contains occurrences of different terms in paragraphs of scientific articles previously extracted by means of a Named Entity Recognition (NER) model. This dataset is transformed into three co-occurrence networks: representing paper- and paragraph-level co-occurrence relation between terms. The term relations in the above-mentioned networks are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).

The networks are further analysed using classical tools from complex networks: we find various centrality measures characterizing the importance of extracted terms, we detect term communities representing denesely connected clusters of terms and finally we illustrate how the algorithms for finding shortest paths and minimum spanning trees can be used to perform guided search in networks.

In [1]:
import networkx as nx
import pandas as pd
import numpy as np

In [2]:
from bluegraph.core import (PandasPGFrame,
                            pretty_print_paths,
                            pretty_print_tripaths,
                            graph_elements_from_paths)
from bluegraph.preprocess.generators import CooccurrenceGenerator

from bluegraph.backends.graph_tool import (GTMetricProcessor,
                                           GTPathFinder,
                                           GTGraphProcessor,
                                           GTCommunityDetector)
from bluegraph.backends.graph_tool import graph_tool_to_pgframe

from bluegraph.backends.networkx import NXCommunityDetector

from bluegraph.backends.stellargraph import StellarGraphNodeEmbedder

## I. Entity-occurrence property graph

In this section we will create a property graph whose nodes are papers and extracted named entities, and whose edges connect entities to the papers they occur in.

The input data is given by occurrences of different entities in specific paragraphs of scientific articles.

In [3]:
mentions = pd.read_csv("data/literature_NER_example.csv")

In [4]:
mentions.sample(5)

Unnamed: 0,entity,occurrence
1434,sitagliptin,184360:Conclusion:63
539,diabetes mellitus,197804:Discussion:37
34,acute respiratory distress syndrome,179426:Morbidity And Mortality In Diabetic Coh...
877,host cell,184360:Mechanisms Of Sars-Cov-2 Entry Into Hos...
491,death,129074:Abstract:1


Every paragraph is identified using the format `<paper_id>:<section_id>:<paragraph_id>`. From this data we will extract occurrences in distinct papers/paragraphs as follows:

In [5]:
# Extract unique paper/seciton/paragraph identifiers
mentions["paper"] = mentions["occurrence"].apply(
    lambda x: x.split(":")[0])

mentions = mentions.rename(columns={"occurrence": "paragraph"})
mentions.sample(5)

Unnamed: 0,entity,paragraph,paper
301,covid-19,197804:Discussion:48,197804
425,covid-19,211125:Study Design And Participants:9,211125
779,glucose,172581::5,172581
133,blood,184360:Gliptins ::: Therapeutic Potential Of T...,184360
357,covid-19,160564:Data Sources And Search Strategy ::: Ma...,160564


We, first, create an empty property graph object.

In [6]:
graph = PandasPGFrame()

Then we add nodes for unique entities and papers

In [7]:
entity_nodes = mentions["entity"].unique()
graph.add_nodes(entity_nodes)
graph.add_node_types({n: "Entity" for n in entity_nodes})

paper_nodes = mentions["paper"].unique()
graph.add_nodes(paper_nodes)
graph.add_node_types({n: "Paper" for n in paper_nodes})

In [8]:
graph.nodes(raw_frame=True)

Unnamed: 0_level_0,@type
@id,Unnamed: 1_level_1
ace inhibitor,Entity
acetaminophen,Entity
acute lung injury,Entity
acute respiratory distress syndrome,Entity
adenosine,Entity
...,...
78884,Paper
35198,Paper
139943,Paper
172581,Paper


We now add edges from entities to the papers they occur in storing paragraphs as edge properties.

In [9]:
occurrence_edges = mentions.groupby(by=["entity", "paper"]).aggregate(set)

In [10]:
occurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,paragraph
entity,paper,Unnamed: 2_level_1
ace inhibitor,184360,{184360:Combined Therapeutic Potential Targeti...
ace inhibitor,197804,{197804:Management Of Children And Young Peopl...
acetaminophen,179426,{179426:Blood Glucose Monitoring ::: Special A...
acetaminophen,197804,{197804:Management Of Children And Young Peopl...
acute lung injury,179426,{179426:Role Of Ace/Arbs ::: Special Aspects O...
...,...,...
virus,184360,{184360:Sdpp4 As Soluble Decoy Factor ::: Ther...
virus,197804,"{197804:Discussion:44, 197804:Introduction:2}"
virus,211125,{211125:Discussion:25}
virus,211373,"{211373:Introduction:6, 211373:Introduction:5,..."


In [11]:
graph.add_edges(occurrence_edges.index)
graph.add_edge_types({e: "OccursIn" for e in occurrence_edges.index})

In [12]:
occurrence_edges.index = occurrence_edges.index.rename(["@source_id", "@target_id"])

In [13]:
graph.add_edge_properties(occurrence_edges["paragraph"])

In [14]:
graph.edges(raw_frame=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,@type,paragraph
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1
ace inhibitor,184360,OccursIn,{184360:Combined Therapeutic Potential Targeti...
ace inhibitor,197804,OccursIn,{197804:Management Of Children And Young Peopl...
acetaminophen,179426,OccursIn,{179426:Blood Glucose Monitoring ::: Special A...
acetaminophen,197804,OccursIn,{197804:Management Of Children And Young Peopl...
acute lung injury,179426,OccursIn,{179426:Role Of Ace/Arbs ::: Special Aspects O...
...,...,...,...
virus,184360,OccursIn,{184360:Sdpp4 As Soluble Decoy Factor ::: Ther...
virus,197804,OccursIn,"{197804:Discussion:44, 197804:Introduction:2}"
virus,211125,OccursIn,{211125:Discussion:25}
virus,211373,OccursIn,"{211373:Introduction:6, 211373:Introduction:5,..."


## II. Entity co-occurrence graphs

We will generate co-occurrence graphs for different occurrence factors (paper/paragraph), i.e. an edge between a pair of entities is added if they co-occur in the same paper or paragraph.

**NB: Some statistics computed during the co-occurrence analysis**

- `ppmi`: _positive pointwise mutual information (PPMI)_ is defined as $PPMI(x, y) = \log_2{\frac{p(x, y)}{p(x)p(y)}} $, if $p(x) \neq 0$ and $p(y) \neq 0$, and $PPMI(x, y) = 0$ otherwise.

- `npmi`: _normalized pointwise mutual information (NPMI)_ is defined as $NPMI(x, y) = \frac{PPMI(x, y)}{-\log_2{p(x, y)}} $.

### Paper-based co-occurrence

We first generate co-occurrence network from edges of type `OccursIn` linking entities and papers.

In [15]:
gen = CooccurrenceGenerator(graph)
paper_cooccurrence_edges = gen.generate_from_edges(
     "OccursIn", compute_statistics=["frequency", "ppmi", "npmi"])

Examining 12246 pairs of terms for co-occurrence...


In [16]:
paper_cooccurrence_edges["@type"] = "CoOccursWith"
paper_cooccurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ace inhibitor,acetaminophen,{197804},1,2.321928,0.537244,CoOccursWith
ace inhibitor,acute lung injury,"{197804, 184360}",2,2.321928,0.698970,CoOccursWith
ace inhibitor,acute respiratory distress syndrome,"{197804, 184360}",2,1.736966,0.522879,CoOccursWith
ace inhibitor,adenosine,{184360},1,2.321928,0.537244,CoOccursWith
ace inhibitor,adipose tissue,"{197804, 184360}",2,2.736966,0.823909,CoOccursWith
...,...,...,...,...,...,...
viral,viral infection,"{184360, 214924, 211125, 211373}",4,2.000000,0.861353,CoOccursWith
viral,virus,"{214924, 211125, 211373, 179426, 184360}",5,1.514573,0.757287,CoOccursWith
viral entry,viral infection,{214924},1,1.321928,0.305865,CoOccursWith
viral entry,virus,"{179426, 214924}",2,1.514573,0.455932,CoOccursWith


In [17]:
entity_nodes = graph.nodes_of_type("Entity").copy()

In [18]:
paper_frequency = mentions.groupby("entity").aggregate(set)["paper"].apply(len)
paper_frequency.name = "paper_frequency"

entity_nodes["paper_frequency"] = paper_frequency

We create a new property graph object from generated edges and entity nodes as follows:

In [19]:
paper_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paper_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
        "paragraph_frequency": "numeric"
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
        "npmi": "numeric"
    })

In [20]:
paper_network.edges(raw_frame=True).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
acute respiratory distress syndrome,pneumonia,"{214924, 211125, 179426, 184360, 197804}",5,1.058894,0.529447,CoOccursWith
hypertension,insulin resistance,"{214924, 197804, 184360}",3,1.736966,0.634632,CoOccursWith
infectious disorder,insulin resistance,"{214924, 197804, 184360}",3,1.152003,0.420905,CoOccursWith
angiotensin-1,tumor necrosis factor,"{214924, 184360}",2,2.736966,0.823909,CoOccursWith
acetaminophen,pneumonia,"{179426, 197804}",2,1.321928,0.39794,CoOccursWith


In [21]:
paper_network.nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paper_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
hcp,Entity,2
transmembrane protein,Entity,2
lung,Entity,6
proliferation,Entity,3
glyburide,Entity,9


### Paragraph-based co-occurrence

We perform similar operation for paragraph-level co-occurrence. In order to use another co-occurrence factor, we will define the following 'factor_aggregator' function (`aggregate_paragraph`) that takes a collection of sets of paragraphs and merges them into the same set. This aggregator will be used to collect sets of common paragraphs of `OccursIn` edges pointing from a pair of entities to the same paper.

In [22]:
def aggregate_paragraphs(data):
    return set(sum(data["paragraph"].apply(list), []))

In [23]:
%%time
paragraph_cooccurrence_edges = gen.generate_from_edges(
     "OccursIn", 
    factor_aggregator=aggregate_paragraphs,
    compute_statistics=["frequency", "ppmi", "npmi"],
    parallelize=True, cores=8)

Computing total factor instances...
Examining 12246 pairs of terms for co-occurrence...
CPU times: user 87.7 ms, sys: 76.4 ms, total: 164 ms
Wall time: 5.43 s


In [24]:
paragraph_cooccurrence_edges["@type"] = "CoOccursWith"
paragraph_cooccurrence_edges.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
lung,neutrophil,{214924:The Protective Role Of Angiotensin-Con...,2,1.736966,0.243638,CoOccursWith
cough,lower respiratory tract infection,"{214924:Introduction:6, 214924:Introduction:3}",2,4.544321,0.637416,CoOccursWith
cardiovascular system,sars-cov-2,"{214924:Introduction:6, 214924:Introduction:4,...",4,0.451563,0.073673,CoOccursWith
headache,respiratory failure,{214924:Introduction:6},1,6.129283,0.753976,CoOccursWith
adipose tissue,cellular secretion,{214924:Angiotensin-Converting Enzyme 2 Expres...,1,4.544321,0.559006,CoOccursWith


In [25]:
paragraph_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paragraph_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
        "npmi": "numeric"
    })

### Faster paragraph-based co-occurrence

Alternatively, to generate paragraph-level co-occurrence network, we can assign sets of paragraphs where entities occur as properties of their respective nodes (as follows).

In [26]:
paragraph_prop = pd.DataFrame({"paragraphs": mentions.groupby("entity").aggregate(set)["paragraph"]})
graph.add_node_properties(paragraph_prop, prop_type="category")
graph.nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paragraphs
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
insulin resistance,Entity,{214924:The Interplay Between Covid-19 And Amp...
macrophage,Entity,{184360:Pathophysiology Of Covid-19: Pulmonar ...
78884,Paper,
interleukin-19,Entity,"{211125:Caption:31, 214924:The Immune Response..."
99569,Paper,


And then use the `generate_from_nodes` method of `CooccurrenceGenerator` in order to generate co-occurrence edges for nodes whose `paragraphs` property has a non-empty intersection.

In [27]:
%%time
generator = CooccurrenceGenerator(graph)
paragraph_cooccurrence_edges = generator.generate_from_nodes(
    "paragraphs", total_factor_instances=len(mentions.paragraph.unique()),
    compute_statistics=["frequency", "npmi"],
    parallelize=True, cores=8)

Examining 15576 pairs of terms for co-occurrence...
CPU times: user 91.2 ms, sys: 71.2 ms, total: 162 ms
Wall time: 1.66 s


### Additional co-occurrence measures: NPMI-based distance

For both paper- and paragraph-based networks we will compute a mutual-information-based distance as follows:
$D = \frac{1}{NPMI}$.

In [28]:
import math

def compute_distance(x):
    return 1 / x if x > 0 else math.inf

In [29]:
npmi_distance = paper_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paper_network.add_edge_properties(npmi_distance, "numeric")

In [30]:
paper_network.edges(raw_frame=True).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type,distance_npmi
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
acetaminophen,coronaviridae,{197804},1,0.514573,0.119061,CoOccursWith,8.399054
cardiovascular system,hcp,{197804},1,1.0,0.231378,CoOccursWith,4.321928
influenza,lung,"{179426, 214924, 184360}",3,1.736966,0.634632,CoOccursWith,1.575717
infectious disorder,pneumonia,"{214924, 211125, 214728, 179426, 184360, 197804}",6,0.736966,0.424283,CoOccursWith,2.356915
pulmonary,rbd,"{214924, 184360}",2,2.321928,0.69897,CoOccursWith,1.430677


In [31]:
npmi_distance = paragraph_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paragraph_network.add_edge_properties(npmi_distance, "numeric")

In [32]:
paper_network.edges(raw_frame=True).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type,distance_npmi
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
acetaminophen,viral entry,{179426},1,2.321928,0.537244,CoOccursWith,1.861353
lower respiratory tract infection,proliferation,"{214924, 211125}",2,2.152003,0.647817,CoOccursWith,1.543645
survival,vascular,"{214924, 184360}",2,2.321928,0.69897,CoOccursWith,1.430677
acute lung injury,hmg-coa reductase inhibitor,"{179426, 214924}",2,2.321928,0.69897,CoOccursWith,1.430677
bradykinin,diabetic ketoacidosis,{179426},1,1.321928,0.305865,CoOccursWith,3.269412


## III. Nearest neighours by co-occurrence scores

To illustrate the importance of computing mutual-information-based scores over raw frequencies consider the following example, where we would like to estimate top closest (most related) neighbors to a specific term.

To do so, we will use the paragraph-based network and the raw co-occurrence frequency as the weight of our co-occurrence relation. The `top_neighbors` method of the `PathFinder` interface provided by the BlueGraph allows us to search for top neighbors with the highest edge weight. In this example, we use `graph_tool`-based `GTPathFinder` interface.

In [33]:
paragraph_path_finder = GTPathFinder(paragraph_network)

Observe in the following cell that the path finder interface generated a backend-specific graph object.

In [34]:
paragraph_path_finder.graph

<Graph object, directed, with 157 vertices and 2755 edges at 0x7ffc847319e8>

In [35]:
paragraph_path_finder.top_neighbors("glucose", 10, weight="frequency")

{'insulin': 11.0,
 'hyperglycemia': 8.0,
 'infectious disorder': 6.0,
 'inflammation': 6.0,
 'sars coronavirus': 6.0,
 'interleukin-6': 5.0,
 'sars-cov-2': 5.0,
 'glucose metabolism disorder': 4.0,
 'glyburide': 4.0,
 'oral cavity': 4.0}

In [36]:
paragraph_path_finder.top_neighbors("lung", 10, weight="frequency")

{'sars-cov-2': 13.0,
 'pulmonary': 10.0,
 'sars coronavirus': 10.0,
 'viral': 9.0,
 'mouse': 7.0,
 'virus': 5.0,
 'vascular': 4.0,
 'viral infection': 4.0,
 'middle east respiratory syndrome coronavirus': 3.0,
 'survival': 3.0}

We observe that 'glucose' and 'lung' share a lot of the closest neighbors by raw frequency. If we look into the list of top 10 entities by paragraph frequency in the entire corpus and we notice that 'glucose' and 'blood' co-occur the most with the terms that are simply the most frequent in our corpus, such as 'covid-19' and 'diabetes mellitus'.

(Closest inspection of the distribution of weighted node degrees suggests that the network contains _hubs_, nodes with significantly high-degree connectivity to other nodes.)

In [37]:
paragraph_network._nodes

Unnamed: 0_level_0,@type,paper_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
ace inhibitor,Entity,2
acetaminophen,Entity,2
acute lung injury,Entity,4
acute respiratory distress syndrome,Entity,6
adenosine,Entity,2
...,...,...
vildagliptin,Entity,2
viral,Entity,5
viral entry,Entity,2
viral infection,Entity,4


In [38]:
paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["paper_frequency"])

Unnamed: 0_level_0,@type,paper_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
covid-19,Entity,20
diabetes mellitus,Entity,19
coronavirus,Entity,16
glucose,Entity,13
death,Entity,9
glyburide,Entity,9
infectious disorder,Entity,9
blood,Entity,8
hyperglycemia,Entity,8
pneumonia,Entity,8


To account for the presence of such hubs, we use the mutual-information-based scores presented above. They 'balance' the influence of the highly connected hub nodes such as 'covid-19' and 'diabetes mellitus' in our example.

In [39]:
paragraph_path_finder.top_neighbors("glucose", 10, weight="npmi")

{'glucose metabolism disorder': 0.43558951200762297,
 'insulin': 0.41957609533629175,
 'thrombophilia': 0.4079646453270325,
 'insulin infusion': 0.3744908698338857,
 'leukopenia': 0.3744908698338857,
 'millimole per liter': 0.3744908698338857,
 'troponin t, cardiac muscle': 0.3744908698338857,
 'hyperglycemia': 0.3255525953220345,
 'serum ferritin': 0.2953531691104843,
 'tissue': 0.2924401924613069}

In [40]:
paragraph_path_finder.top_neighbors("lung", 10, weight="npmi")

{'pulmonary': 0.6362945400636919,
 'receptor binding': 0.46595542454855043,
 'viral infection': 0.4465392749200213,
 'mouse': 0.4362945258726772,
 'viral': 0.4233482775367079,
 'sars coronavirus': 0.4042589954647716,
 'survival': 0.39499326084547054,
 'm protein': 0.2856252008999382,
 'myocardium': 0.2856252008999382,
 'neoplasm': 0.2856252008999382}

## IV. Graph metrics and centrality measures

BlueGraph provides the `MetricProcessor` interface for computing various graph statistics. As in the previous example, we will use `graph_tool`-based `GTMetricProcessor` interface.

In [41]:
paper_metrics = GTMetricProcessor(paper_network, directed=False)
paragraph_metrics = GTMetricProcessor(paragraph_network, directed=False)

### Graph density

Density of a graph is quantified by the proportion of all possible edges ($n(n-1) / 2$ for the undirected graph with $n$ nodes) that are realized.

In [42]:
print("Density of the paper-based network: ", paper_metrics.density())
print("Density of the paragraph-based network: ", paragraph_metrics.density())

Density of the paper-based network:  0.7960150253143884
Density of the paragraph-based network:  0.22497141923893516


The results above show that in the paper, section and paragraph network repsectively 80%, 42% and 22% of all possible term pairs co-occur at least once.

### Node centrality (importance) measures

In this example we will compute the Degree and PageRank centralities only for the raw frequency, and the Betweenness centrality for the mutual-information-based scores. We will use methods provided by the `MetricProcessor` interface in the _write_ mode, i.e. computed metrics will be written as node properties of the underlying graph object.

_Degree centrality_ is given by the sum of weights of all incident edges of the given node and characterizes the importance of the node in the network in terms of its connectivity to other nodes (high degree = high connectivity).

In [43]:
paragraph_metrics.degree_centrality("frequency", write=True, write_property="degree")

_PageRank centrality_ is another measure that estimated the importance of the given node in the network. Roughly speaking it can be interpreted as the probablity that having landed on a random node in the network we will jump to the given node (here the edge weights are taken into account").

https://en.wikipedia.org/wiki/PageRank

In [44]:
paragraph_metrics.pagerank_centrality("frequency", write=True, write_property="pagerank")

We then compute the betweenness centrality based on the NPMI distances.

_Betweenness centrality_ is a node importance measure that estimates how often a shortest path between a pair of nodes will pass through the given node.

In [45]:
paragraph_metrics.betweenness_centrality("distance_npmi", write=True, write_property="betweenness")

We can inspect the underlying graph object and observe the newly added properties:

In [46]:
paragraph_metrics.graph.vp.keys()

['@id', '@type', 'paper_frequency', 'degree', 'pagerank', 'betweenness']

Now, we will export this backend-specific graph object into a `PGFrame`.

In [47]:
new_paragraph_network = paragraph_metrics.get_pgframe()

In [48]:
new_paragraph_network.nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paper_frequency,degree,pagerank,betweenness
@id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"diarrhea, ctcae",Entity,3.0,38.0,0.004183,0.005707
fatigue,Entity,2.0,23.0,0.002879,0.002257
vascular,Entity,2.0,130.0,0.009778,0.002895
cardiac failure,Entity,2.0,18.0,0.002256,0.001554
angiotensin-2,Entity,2.0,19.0,0.002339,0.014006


In [49]:
print("Top 10 nodes by degree")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["degree"]).index:
    print("\t", n)

Top 10 nodes by degree
	 covid-19
	 diabetes mellitus
	 sars-cov-2
	 angiotensin-converting enzyme 2
	 dipeptidyl peptidase 4
	 lung
	 glucose
	 coronavirus
	 sars coronavirus
	 viral


In [50]:
print("Top 10 nodes by PageRank")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["pagerank"]).index:
    print("\t", n)

Top 10 nodes by PageRank
	 covid-19
	 diabetes mellitus
	 sars-cov-2
	 angiotensin-converting enzyme 2
	 dipeptidyl peptidase 4
	 glucose
	 lung
	 coronavirus
	 viral
	 sars coronavirus


In [51]:
print("Top 10 nodes by betweenness")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["betweenness"]).index:
    print("\t", n)

Top 10 nodes by betweenness
	 lymphopenia
	 pulmonary
	 glucose metabolism disorder
	 t-lymphocyte
	 cough
	 chemokine
	 d-dimer measurement
	 kidney
	 interleukin-19
	 ibuprofen


### Compute multiple metrics in one go

Alternatively, we can compute all the metrics in one go. To do so, we need to specify edge attributes used for computing different metrics (if an empty list is specified as a weight list for a metric, computation of this metric is not performed). 

We select the paragraph-based network and re-compute all some of the previously illustrated metrics as follows:

In [52]:
result_metrics = paragraph_metrics.compute_all_node_metrics(
    degree_weights=["frequency"],
    pagerank_weights=["frequency"],
    betweenness_weights=["distance_npmi"])

In [53]:
result_metrics

{'degree': {'frequency': {'ace inhibitor': 47.0,
   'acetaminophen': 7.0,
   'acute lung injury': 133.0,
   'acute respiratory distress syndrome': 153.0,
   'adenosine': 24.0,
   'adipose tissue': 28.0,
   'angioedema': 23.0,
   'angiotensin ii receptor antagonist': 87.0,
   'angiotensin-1': 77.0,
   'angiotensin-2': 19.0,
   'angiotensin-converting enzyme': 13.0,
   'angiotensin-converting enzyme 2': 366.0,
   'animal': 53.0,
   'apoptosis': 20.0,
   'bals r': 6.0,
   'basal': 36.0,
   'blood': 174.0,
   'blood vessel': 30.0,
   'bradykinin': 18.0,
   'c-c motif chemokine 1': 48.0,
   'c-reactive protein': 35.0,
   'cardiac failure': 18.0,
   'cardiovascular disorder': 109.0,
   'cardiovascular system': 119.0,
   'cd44 antigen': 46.0,
   'cellular secretion': 49.0,
   'cerebrovascular': 19.0,
   'chemokine': 45.0,
   'chest pain': 23.0,
   'chloroquine': 8.0,
   'chronic disease': 12.0,
   'chronic kidney disease': 19.0,
   'comorbidity': 19.0,
   'confounding factors': 26.0,
   'coro

## V. Export network and the computed metrics

In [65]:
# Save graph as JSON
new_paragraph_network.export_json("data/literature_comention.json")

In [64]:
# Save the graph for Gephi import.
new_paragraph_network.export_to_gephi(
    "data/gephi_literature_comention", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the NPMI weight, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

**Full network**
<img src="./figures/literature/full_network.png" alt="Literature co-occurrence network" style="width: 400px;"/>

**Community "Symptoms and comorbidities"**
<img src="./figures/literature/covid_19_comorbidities.png" alt="Drawing" style="width: 400px;"/>

**Community "Viral biology"**
<img src="./figures/literature/virus.png" alt="Drawing" style="width: 600px;"/>

**Community "Immunity"**
<img src="./figures/literature/immunity.png" alt="Drawing" style="width: 600px;"/>

## VI. Community detection

_Community detection_ methods partition the network into clusters of densely connected nodes in a way that nodes in the same community are more connected between themselves relatively to the nodes in different communities. In this section we will illustrate the use of the `CommunityDetector` interface provided by BlueGraph for community detection and estimation of its quality using modularity, performance and coverange methods. The unified interface allows us to use various community detection methods available in different graph backends.

First, we create a `NetworkX`-based instance and use several different community detection strategies provided by this library. 

In [54]:
nx_detector = NXCommunityDetector(paragraph_network, directed=False)

In [55]:
nx_detector.graph

<networkx.classes.graph.Graph at 0x7ffc84610b00>

### Louvain algorithm

In [56]:
partition = nx_detector.detect_communities(
    strategy="louvain", weight="npmi")

In [57]:
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage"))

Modularity:  0.18724002885365992
Performance:  0.7733953944145027
Coverage:  0.37313974591651544


### Label propagation

In [58]:
partition = nx_detector.detect_communities(
    strategy="lpa", weight="frequency", intermediate=False)

In [59]:
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage"))

Modularity:  0.0
Performance:  0.22497141923893516
Coverage:  1.0


### Stochastic block model

In [60]:
gt_detector = GTCommunityDetector(paragraph_network, directed=False)

In [61]:
partition = gt_detector.detect_communities(strategy="sbm")

In [62]:
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage"))

Modularity:  0.07518940978455275
Performance:  0.7735587130491589
Coverage:  0.1647912885662432


## VII. Minimum spanning trees

A _minimum spanning tree_ of a network is given by a subset of edges that make the network connected ($n - 1$ edges connecting $n$ nodes). Its weighted version minimizes not only the number of edges included in the tree, but the total edge weight.

In the following example we compute a minimum spanning tree minimizing the NPMI-based distance weight of the network edges.

In [112]:
gt_paragraph_path_finder = GTPathFinder(new_paragraph_network)

In [113]:
tree = graph_tool_to_pgframe(gt_paragraph_path_finder.minimum_spanning_tree(distance="distance_npmi"))

In [114]:
tree.export_to_gephi(
    "data/gephi_literature_spanning_tree", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
#         "community_npmi": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the NPMI weight, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

**Full network**
<img src="./figures/literature/tree.png" alt="Literature co-occurrence MST" style="width: 400px;"/>

**Zoom into "covid-19"**
<img src="./figures/literature/tree_covid-19.png" alt="Drawing" style="width: 700px;"/>

## Shortest path search

The _shortest path search problem_ consisits in finding a sequence of edges from the source node to the target node that minimizes the cumulative weight (or distance) associated to the edges. 

In [115]:
path = gt_paragraph_path_finder.shortest_path("lung", "sars-cov-2")
pretty_print_paths([path])

lung <->  <-> sars-cov-2
         


The cell above illustrates that the single shortest path form 'lung' and 'sars-cov-2' consists of the direct edge between them.

We adapt this problem to the literature exploration task, i.e. having fixed the source and the target concepts (the relation here is actually symmetric as the edges of our network are undirected), we would like to find a _set_ of $n$ shortest paths between them. Moreover, we would like these paths to be _indirect_ (not to include the direct edge from the source to the target). In the following examples we use mutual-information-based edge weights to perform our literature exploration. 

The library includes two strategies for finding such $n$ shortest paths. The first strategy uses Yen's algorithm for finding $n$ loopless shortest paths from the source to the target (https://en.wikipedia.org/wiki/Yen%27s_algorithm).


In [116]:
nx_paragraph_path_finder = NXPathFinder(new_paragraph_network)
paths = nx_paragraph_path_finder.n_shortest_paths(
    "lung", "sars-cov-2", n=10,
    distance="distance_npmi", strategy="yen")

In [117]:
pretty_print_paths(paths)

lung <->                                 <-> sars-cov-2
         
         acute lung injury
         pulmonary
         receptor binding
         human
         viral
         angiotensin-converting enzyme 2
         host cell
         dna replication
         acute lung injury <-> pulmonary


The second, _naive_, strategy is suitable in the scenarios when our networks are large and highly dense (then the performance of Yen's algorithm degragates as the number of edges is approaching $O(N^2)$ with $N$ being the number of nodes). 

This strategy simply finds _all_ the indirect shortest paths from the source to the target (in dense graphs the most common such paths are of length 2, i.e. `source <-> intermediary <-> target`, and therefore, the number of such path is roughly proportional to the number of nodes in the network). Then, the cumulative distance score is computed for every path and the top $n$ paths with the best score are selected.

In [118]:
paths = gt_paragraph_path_finder.n_shortest_paths(
    "lung", "sars-cov-2", n=10,
    distance="distance_npmi", strategy="naive")

In [119]:
pretty_print_paths(paths)

lung <->                                 <-> sars-cov-2
         
         acute lung injury
         pulmonary
         receptor binding
         human
         viral
         angiotensin-converting enzyme 2
         host cell
         dna replication
         human kidney organoids


We can also search for a path in the previously computed spanning tree

In [120]:
tree_path_finder = GTPathFinder(tree)
path = tree_path_finder.shortest_path("lung", "sars-cov-2")
print(path)

('lung', 'acute lung injury', 'animal', 'mouse', 'human dpp4', 'human immunodeficiency virus', 'coronaviridae', 'sars-cov-2')


The library provides an additional utility for finding _tripaths_, paths of the shape `source <-> intermediary <-> target`. Setting the parameter `intersecting` to `False` we can ensure that the entities the sets of entities discovered on the paths `source <-> intermediary` and `intermediary <-> target` do not overlap.

In [121]:
path_a_b, path_b_c = gt_paragraph_path_finder.n_shortest_tripaths(
    "lung", "glucose", "sars-cov-2", 10,
    strategy="naive", distance="distance_npmi", overlap=False)

In [122]:
pretty_print_tripaths("lung", "glucose", "sars-cov-2", 10, path_a_b, path_b_c)

lung ->                                -> glucose ->                   -> sars-cov-2
        inflammation                                   cerebrovascular
        islet of langerhans                            plasmid
        serum                                          basal
        oral cavity                                    coronavirus
        death                                          headache
        middle east respiratory syndrome               comorbidity
        neutrophil                                     infectious disorder
        sulfonylurea antidiabetic agent                viral entry
        therapeutic corticosteroid                     sars coronavirus
        chemokine                                      person


## Nested path search

To explore the space of co-occurring terms in depth, we can run the path search procedure presented above in a _nested fashion_. For each edge $e_1, e_2, ..., e_n$ encountered on a path from the source to the target  from, we can
further expand it into $n$ shortest paths between each pair of successive entities (i.e. paths between $e_1$ and $e_2$, $e_2$ and $e_3$, etc.).

In [123]:
paths = gt_paragraph_path_finder.n_nested_shortest_paths(
    "lung", "glucose", top_level_n=10, nested_n=2, depth=2, distance="distance_npmi",
    strategy="naive")

In [124]:
paths

[('inflammation', 'glucose metabolism disorder', 'glucose'),
 ('neutrophil', 'lymphocyte', 'glucose'),
 ('inflammation', 'thrombophilia', 'glucose'),
 ('lung', 'serum', 'inflammation'),
 ('lung', 'oral cavity', 'glucose'),
 ('chemokine', 'growth factor', 'glucose'),
 ('neutrophil', 'millimole per liter', 'glucose'),
 ('therapeutic corticosteroid', 'oral cavity', 'glucose'),
 ('islet of langerhans', 'hyperglycemia', 'glucose'),
 ('lung', 'chemokine', 'glucose'),
 ('sulfonylurea antidiabetic agent', 'blood', 'glucose'),
 ('lung', 'sulfonylurea antidiabetic agent', 'glucose'),
 ('lung', 'serum', 'glucose'),
 ('middle east respiratory syndrome', 'lymphocyte', 'glucose'),
 ('chemokine', 'glucose metabolism disorder', 'glucose'),
 ('therapeutic corticosteroid', 'insulin', 'glucose'),
 ('lung', 'inflammation'),
 ('serum', 'thrombophilia', 'glucose'),
 ('death', 'leukopenia', 'glucose'),
 ('lung', 'oral cavity'),
 ('middle east respiratory syndrome', 'glucose'),
 ('oral cavity', 'glucose'),
 (

We can now visualize the subnetwork constructed using the nodes and the edges discovered during our nested path search.

In [125]:
nodes, edges = graph_elements_from_paths(paths)

In [126]:
summary_graph = graph_tool_to_pgframe(
    gt_paragraph_path_finder.get_subgraph(
        nodes_to_exclude=[
            n for n in gt_paragraph_path_finder.get_nodes()
            if n not in nodes
        ],
        edges_to_exclude=[
            (s, t) for s, t in gt_paragraph_path_finder.get_edges()
            if (s, t) not in edges
        ]))

In [127]:
print("Number of nodes: ", summary_graph.number_of_nodes())
print("Number of edges: ", summary_graph.number_of_edges())

Number of nodes:  27
Number of edges:  23


In [128]:
# Save the graph for Gephi import.
summary_graph.export_to_gephi(
    "data/gephi_literature_path_graph", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
#         "community_npmi": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The resulting graph visualized with Gephi
<img src="./figures/literature/path_graph.png" alt="Literature co-occurrence network" style="width: 800px;"/>