# Literature exploration by term co-occurrance analysis

In this example we illustrate how network analytics can be used for literature exploration. 

The input dataset contains occurrences of different terms in paragraphs of scientific articles previously extracted by means of a Named Entity Recognition (NER) model. This dataset is transformed into three co-occurrence networks: representing paper- and paragraph-level co-occurrence relation between terms. The term relations in the above-mentioned networks are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).

The networks are further analysed using classical tools from complex networks: we find various centrality measures characterizing the importance of extracted terms, we detect term communities representing denesely connected clusters of terms and finally we illustrate how the algorithms for finding shortest paths and minimum spanning trees can be used to perform guided search in networks.

In [1]:
import networkx as nx
import pandas as pd
import numpy as np

In [2]:
from bluegraph.core import (PandasPGFrame,
                            pretty_print_paths,
                            pretty_print_tripaths,
                            graph_elements_from_paths)
from bluegraph.preprocess.generators import CooccurrenceGenerator

from bluegraph.backends.graph_tool import (GTMetricProcessor,
                                           GTPathFinder,
                                           GTGraphProcessor,
                                           GTCommunityDetector)
from bluegraph.backends.graph_tool import graph_tool_to_pgframe

from bluegraph.backends.networkx import NXCommunityDetector, NXPathFinder

from bluegraph.backends.stellargraph import StellarGraphNodeEmbedder

## I. Entity-occurrence property graph

In this section we will create a property graph whose nodes are papers and extracted named entities, and whose edges connect entities to the papers they occur in.

The input data is given by occurrences of different entities in specific paragraphs of scientific articles.

In [3]:
mentions = pd.read_csv("../data/literature_NER_example.csv")

In [4]:
mentions.sample(5)

Unnamed: 0,entity,occurrence
1033,insulin infusion,78018:Abstract:1
886,human,214924:Conclusion:28
351,covid-19,197804:Introduction:2
319,covid-19,179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitors...
1488,vascular,184360:Combined Therapeutic Potential Targetin...


Every paragraph is identified using the format `<paper_id>:<section_id>:<paragraph_id>`. From this data we will extract occurrences in distinct papers/paragraphs as follows:

In [5]:
# Extract unique paper/seciton/paragraph identifiers
mentions["paper"] = mentions["occurrence"].apply(
    lambda x: x.split(":")[0])

mentions = mentions.rename(columns={"occurrence": "paragraph"})
mentions.sample(5)

Unnamed: 0,entity,paragraph,paper
791,glucose,35198:Abstract:1,35198
1498,vildagliptin,184360:Gliptins ::: Therapeutic Potential Of T...,184360
204,cellular secretion,184360:Gliptins ::: Therapeutic Potential Of T...,184360
522,diabetes mellitus,211125:Introduction:7,211125
1342,sars coronavirus,179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitors...,179426


We, first, create an empty property graph object.

In [6]:
graph = PandasPGFrame()

Then we add nodes for unique entities and papers

In [7]:
entity_nodes = mentions["entity"].unique()
graph.add_nodes(entity_nodes)
graph.add_node_types({n: "Entity" for n in entity_nodes})

paper_nodes = mentions["paper"].unique()
graph.add_nodes(paper_nodes)
graph.add_node_types({n: "Paper" for n in paper_nodes})

In [8]:
graph.nodes(raw_frame=True)

Unnamed: 0_level_0,@type
@id,Unnamed: 1_level_1
ace inhibitor,Entity
acetaminophen,Entity
acute lung injury,Entity
acute respiratory distress syndrome,Entity
adenosine,Entity
...,...
78884,Paper
35198,Paper
139943,Paper
172581,Paper


We now add edges from entities to the papers they occur in storing paragraphs as edge properties.

In [9]:
occurrence_edges = mentions.groupby(by=["entity", "paper"]).aggregate(set)

In [10]:
occurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,paragraph
entity,paper,Unnamed: 2_level_1
ace inhibitor,184360,"{184360:Caption:68, 184360:Aceis And Arbs ::: ..."
ace inhibitor,197804,"{197804:Caption:71, 197804:Caption:72, 197804:..."
acetaminophen,179426,{179426:Blood Glucose Monitoring ::: Special A...
acetaminophen,197804,"{197804:Discussion:52, 197804:Discussion:51, 1..."
acute lung injury,179426,{179426:Role Of Ace/Arbs ::: Special Aspects O...
...,...,...
virus,184360,{184360:Sdpp4 As Soluble Decoy Factor ::: Ther...
virus,197804,"{197804:Introduction:2, 197804:Discussion:44}"
virus,211125,{211125:Discussion:25}
virus,211373,"{211373:Introduction:3, 211373:Introduction:5,..."


In [11]:
graph.add_edges(occurrence_edges.index)
graph.add_edge_types({e: "OccursIn" for e in occurrence_edges.index})

In [12]:
occurrence_edges.index = occurrence_edges.index.rename(["@source_id", "@target_id"])

In [13]:
graph.add_edge_properties(occurrence_edges["paragraph"])

In [14]:
graph.edges(raw_frame=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,@type,paragraph
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1
ace inhibitor,184360,OccursIn,"{184360:Caption:68, 184360:Aceis And Arbs ::: ..."
ace inhibitor,197804,OccursIn,"{197804:Caption:71, 197804:Caption:72, 197804:..."
acetaminophen,179426,OccursIn,{179426:Blood Glucose Monitoring ::: Special A...
acetaminophen,197804,OccursIn,"{197804:Discussion:52, 197804:Discussion:51, 1..."
acute lung injury,179426,OccursIn,{179426:Role Of Ace/Arbs ::: Special Aspects O...
...,...,...,...
virus,184360,OccursIn,{184360:Sdpp4 As Soluble Decoy Factor ::: Ther...
virus,197804,OccursIn,"{197804:Introduction:2, 197804:Discussion:44}"
virus,211125,OccursIn,{211125:Discussion:25}
virus,211373,OccursIn,"{211373:Introduction:3, 211373:Introduction:5,..."


## II. Entity co-occurrence graphs

We will generate co-occurrence graphs for different occurrence factors (paper/paragraph), i.e. an edge between a pair of entities is added if they co-occur in the same paper or paragraph.

**NB: Some statistics computed during the co-occurrence analysis**

- `ppmi`: _positive pointwise mutual information (PPMI)_ is defined as $PPMI(x, y) = \log_2{\frac{p(x, y)}{p(x)p(y)}} $, if $p(x) \neq 0$ and $p(y) \neq 0$, and $PPMI(x, y) = 0$ otherwise.

- `npmi`: _normalized pointwise mutual information (NPMI)_ is defined as $NPMI(x, y) = \frac{PPMI(x, y)}{-\log_2{p(x, y)}} $.

### Paper-based co-occurrence

We first generate co-occurrence network from edges of type `OccursIn` linking entities and papers.

In [15]:
gen = CooccurrenceGenerator(graph)
paper_cooccurrence_edges = gen.generate_from_edges(
     "OccursIn", compute_statistics=["frequency", "ppmi", "npmi"])

In [16]:
paper_cooccurrence_edges["@type"] = "CoOccursWith"
paper_cooccurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ace inhibitor,acetaminophen,{197804},1,2.321928,0.537244,CoOccursWith
ace inhibitor,acute lung injury,"{197804, 184360}",2,2.321928,0.698970,CoOccursWith
ace inhibitor,acute respiratory distress syndrome,"{197804, 184360}",2,1.736966,0.522879,CoOccursWith
ace inhibitor,adenosine,{184360},1,2.321928,0.537244,CoOccursWith
ace inhibitor,adipose tissue,"{197804, 184360}",2,2.736966,0.823909,CoOccursWith
...,...,...,...,...,...,...
viral,viral infection,"{214924, 184360, 211125, 211373}",4,2.000000,0.861353,CoOccursWith
viral,virus,"{214924, 184360, 179426, 211373, 211125}",5,1.514573,0.757287,CoOccursWith
viral entry,viral infection,{214924},1,1.321928,0.305865,CoOccursWith
viral entry,virus,"{214924, 179426}",2,1.514573,0.455932,CoOccursWith


From the generated edges we remove the ones with zero NPMI scores.

In [17]:
paper_cooccurrence_edges = paper_cooccurrence_edges[paper_cooccurrence_edges["npmi"] != 0]

In [18]:
entity_nodes = graph.nodes_of_type("Entity").copy()

In [19]:
paper_frequency = mentions.groupby("entity").aggregate(set)["paper"].apply(len)
paper_frequency.name = "paper_frequency"

entity_nodes["paper_frequency"] = paper_frequency

We create a new property graph object from generated edges and entity nodes as follows:

In [20]:
paper_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paper_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
        "paragraph_frequency": "numeric"
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
        "npmi": "numeric"
    })

In [21]:
paper_network.edges(raw_frame=True).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
heart,severe acute respiratory syndrome,{184360},1,0.415037,0.096031,CoOccursWith
glyburide,hcp,{197804},1,0.152003,0.03517,CoOccursWith
molecule,person,{184360},1,1.321928,0.305865,CoOccursWith
immune cell,lung,"{184360, 211373}",2,1.736966,0.522879,CoOccursWith
acute respiratory distress syndrome,c-c motif chemokine 1,"{214924, 184360}",2,1.736966,0.522879,CoOccursWith


In [22]:
paper_network.nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paper_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
dyspnea,Entity,2
cardiovascular disorder,Entity,4
m protein,Entity,2
adenosine,Entity,2
acetaminophen,Entity,2


### Paragraph-based co-occurrence

We perform similar operation for paragraph-level co-occurrence. In order to use another co-occurrence factor, we will define the following 'factor_aggregator' function (`aggregate_paragraph`) that takes a collection of sets of paragraphs and merges them into the same set. This aggregator will be used to collect sets of common paragraphs of `OccursIn` edges pointing from a pair of entities to the same paper.

In [23]:
def aggregate_paragraphs(data):
    return set(sum(data["paragraph"].apply(list), []))

In [24]:
%%time
paragraph_cooccurrence_edges = gen.generate_from_edges(
     "OccursIn", 
    factor_aggregator=aggregate_paragraphs,
    compute_statistics=["frequency", "ppmi", "npmi"],
    parallelize=True, cores=8)

CPU times: user 73.5 ms, sys: 61.1 ms, total: 135 ms
Wall time: 3.93 s


In [25]:
paragraph_cooccurrence_edges["@type"] = "CoOccursWith"
paragraph_cooccurrence_edges.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
hyperglycemia,sars-cov-2,"{211125:Discussion:25, 214924:Angiotensin-Conv...",2,0.0,0.0,CoOccursWith
cardiovascular disorder,sars-cov-2,{214924:Diabetes Mellitus: A Risk Factor For T...,4,0.567041,0.092513,CoOccursWith
glyburide,glycosylated hemoglobin measurement,{179426:Association Of Diabetes With Acute Vir...,1,2.020759,0.248578,CoOccursWith
cardiovascular disorder,cerebrovascular,{214924:Diabetes Mellitus: A Risk Factor For T...,1,3.544321,0.435994,CoOccursWith
hypertension,nephropathy,{197804:Management Of Children And Young Peopl...,1,2.669851,0.328424,CoOccursWith


From the generated edges we remove the ones with zero NPMI scores.

In [26]:
paragraph_cooccurrence_edges = paragraph_cooccurrence_edges[paragraph_cooccurrence_edges["npmi"] != 0]

In [27]:
paragraph_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paragraph_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
        "npmi": "numeric"
    })

### Faster paragraph-based co-occurrence

Alternatively, to generate paragraph-level co-occurrence network, we can assign sets of paragraphs where entities occur as properties of their respective nodes (as follows).

In [28]:
paragraph_prop = pd.DataFrame({"paragraphs": mentions.groupby("entity").aggregate(set)["paragraph"]})
graph.add_node_properties(paragraph_prop, prop_type="category")
graph.nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paragraphs
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
"diarrhea, ctcae",Entity,{211125:General Characteristics Of Patients Wi...
197804,Paper,
cough,Entity,"{197804:Introduction:2, 214924:Introduction:6,..."
nephropathy,Entity,{214924:Angiotensin-Converting Enzyme 2 Expres...
cardiac failure,Entity,"{184360:Caption:71, 211125:Study Design And Pa..."


And then use the `generate_from_nodes` method of `CooccurrenceGenerator` in order to generate co-occurrence edges for nodes whose `paragraphs` property has a non-empty intersection.

In [29]:
%%time
generator = CooccurrenceGenerator(graph)
paragraph_cooccurrence_edges = generator.generate_from_nodes(
    "paragraphs", total_factor_instances=len(mentions.paragraph.unique()),
    compute_statistics=["frequency", "npmi"],
    parallelize=True, cores=8)

CPU times: user 68 ms, sys: 56.4 ms, total: 124 ms
Wall time: 880 ms


### Additional co-occurrence measures: NPMI-based distance

For both paper- and paragraph-based networks we will compute a mutual-information-based distance as follows:
$D = \frac{1}{NPMI}$.

In [30]:
import math

def compute_distance(x):
    return 1 / x if x > 0 else math.inf

In [31]:
npmi_distance = paper_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paper_network.add_edge_properties(npmi_distance, "numeric")

In [32]:
paper_network.edges(raw_frame=True).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type,distance_npmi
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
chemokine,hmg-coa reductase inhibitor,"{214924, 179426}",2,2.736966,0.823909,CoOccursWith,1.213727
pneumonia,transmembrane protein,"{184360, 179426}",2,1.321928,0.39794,CoOccursWith,2.512942
ace inhibitor,obesity,{184360},1,1.321928,0.305865,CoOccursWith,3.269412
interleukin 1 beta measurement,sulfonylurea antidiabetic agent,{184360},1,2.321928,0.537244,CoOccursWith,1.861353
glucose metabolism disorder,nephropathy,"{197804, 214924}",2,2.321928,0.69897,CoOccursWith,1.430677


In [33]:
npmi_distance = paragraph_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paragraph_network.add_edge_properties(npmi_distance, "numeric")

In [34]:
paper_network.edges(raw_frame=True).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type,distance_npmi
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
hypoxia,tissue,"{214924, 184360}",2,2.736966,0.823909,CoOccursWith,1.213727
islet of langerhans,proliferation,{211125},1,1.736966,0.401896,CoOccursWith,2.488206
chronic disease,middle east respiratory syndrome,{179426},1,1.321928,0.305865,CoOccursWith,3.269412
saxagliptin,tissue,{184360},1,1.736966,0.401896,CoOccursWith,2.488206
heart,hmg-coa reductase inhibitor,{214924},1,1.0,0.231378,CoOccursWith,4.321928


## III. Nearest neighours by co-occurrence scores

To illustrate the importance of computing mutual-information-based scores over raw frequencies consider the following example, where we would like to estimate top closest (most related) neighbors to a specific term.

To do so, we will use the paragraph-based network and the raw co-occurrence frequency as the weight of our co-occurrence relation. The `top_neighbors` method of the `PathFinder` interface provided by the BlueGraph allows us to search for top neighbors with the highest edge weight. In this example, we use `graph_tool`-based `GTPathFinder` interface.

In [35]:
paragraph_path_finder = GTPathFinder(paragraph_network, directed=False)

Observe in the following cell that the path finder interface generated a backend-specific graph object.

In [36]:
paragraph_path_finder.graph

<Graph object, undirected, with 157 vertices and 2479 edges, 3 internal vertex properties, 6 internal edge properties, at 0x7f91201c7dd0>

In [37]:
paragraph_path_finder.top_neighbors("glucose", 10, weight="frequency")

{'diabetes mellitus': 29.0,
 'blood': 18.0,
 'insulin': 11.0,
 'death': 9.0,
 'hyperglycemia': 8.0,
 'coronavirus': 8.0,
 'infectious disorder': 6.0,
 'inflammation': 6.0,
 'sars coronavirus': 6.0,
 'interleukin-6': 5.0}

In [38]:
paragraph_path_finder.top_neighbors("lung", 10, weight="frequency")

{'covid-19': 24.0,
 'angiotensin-converting enzyme 2': 16.0,
 'sars-cov-2': 13.0,
 'acute lung injury': 12.0,
 'pulmonary': 10.0,
 'sars coronavirus': 10.0,
 'viral': 9.0,
 'human': 9.0,
 'mouse': 7.0,
 'inflammation': 6.0}

We observe that 'glucose' and 'lung' share a lot of the closest neighbors by raw frequency. If we look into the list of top 10 entities by paragraph frequency in the entire corpus and we notice that 'glucose' and 'blood' co-occur the most with the terms that are simply the most frequent in our corpus, such as 'covid-19' and 'diabetes mellitus'.

(Closest inspection of the distribution of weighted node degrees suggests that the network contains _hubs_, nodes with significantly high-degree connectivity to other nodes.)

In [39]:
paragraph_network._nodes

Unnamed: 0_level_0,@type,paper_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
ace inhibitor,Entity,2
acetaminophen,Entity,2
acute lung injury,Entity,4
acute respiratory distress syndrome,Entity,6
adenosine,Entity,2
...,...,...
vildagliptin,Entity,2
viral,Entity,5
viral entry,Entity,2
viral infection,Entity,4


In [40]:
paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["paper_frequency"])

Unnamed: 0_level_0,@type,paper_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1
covid-19,Entity,20
diabetes mellitus,Entity,19
coronavirus,Entity,16
glucose,Entity,13
death,Entity,9
glyburide,Entity,9
infectious disorder,Entity,9
blood,Entity,8
hyperglycemia,Entity,8
pneumonia,Entity,8


To account for the presence of such hubs, we use the mutual-information-based scores presented above. They 'balance' the influence of the highly connected hub nodes such as 'covid-19' and 'diabetes mellitus' in our example.

In [41]:
paragraph_path_finder.top_neighbors("glucose", 10, weight="npmi")

{'blood': 0.5133209650995287,
 'glucose metabolism disorder': 0.43558951200762297,
 'insulin': 0.41957609533629175,
 'thrombophilia': 0.4079646453270325,
 'insulin infusion': 0.3744908698338857,
 'leukopenia': 0.3744908698338857,
 'millimole per liter': 0.3744908698338857,
 'troponin t, cardiac muscle': 0.3744908698338857,
 'bals r': 0.3744908698338857,
 'hyperglycemia': 0.3255525953220345}

In [42]:
paragraph_path_finder.top_neighbors("lung", 10, weight="npmi")

{'acute lung injury': 0.731006557092012,
 'pulmonary': 0.6362945400636919,
 'angiotensin-converting enzyme 2': 0.4757184436640079,
 'receptor binding': 0.46595542454855043,
 'viral infection': 0.4465392749200213,
 'animal': 0.4465392749200213,
 'mouse': 0.4362945258726772,
 'angiotensin-1': 0.4259996541516483,
 'viral': 0.4233482775367079,
 'human': 0.4098154380746763}

## IV. Graph metrics and centrality measures

BlueGraph provides the `MetricProcessor` interface for computing various graph statistics. As in the previous example, we will use `graph_tool`-based `GTMetricProcessor` interface.

In [43]:
paper_metrics = GTMetricProcessor(paper_network, directed=False)
paragraph_metrics = GTMetricProcessor(paragraph_network, directed=False)

### Graph density

Density of a graph is quantified by the proportion of all possible edges ($n(n-1) / 2$ for the undirected graph with $n$ nodes) that are realized.

In [44]:
print("Density of the paper-based network: ", paper_metrics.density())
print("Density of the paragraph-based network: ", paragraph_metrics.density())

Density of the paper-based network:  0.7769884043769394
Density of the paragraph-based network:  0.20243344765637758


The results above show that in the paper, section and paragraph network repsectively 80%, 42% and 22% of all possible term pairs co-occur at least once.

### Node centrality (importance) measures

In this example we will compute the Degree and PageRank centralities only for the raw frequency, and the Betweenness centrality for the mutual-information-based scores. We will use methods provided by the `MetricProcessor` interface in the _write_ mode, i.e. computed metrics will be written as node properties of the underlying graph object.

_Degree centrality_ is given by the sum of weights of all incident edges of the given node and characterizes the importance of the node in the network in terms of its connectivity to other nodes (high degree = high connectivity).

In [45]:
paragraph_metrics.degree_centrality("frequency", write=True, write_property="degree")

_PageRank centrality_ is another measure that estimated the importance of the given node in the network. Roughly speaking it can be interpreted as the probablity that having landed on a random node in the network we will jump to the given node (here the edge weights are taken into account").

https://en.wikipedia.org/wiki/PageRank

In [46]:
paragraph_metrics.pagerank_centrality("frequency", write=True, write_property="pagerank")

We then compute the betweenness centrality based on the NPMI distances.

_Betweenness centrality_ is a node importance measure that estimates how often a shortest path between a pair of nodes will pass through the given node.

In [47]:
paragraph_metrics.betweenness_centrality("distance_npmi", write=True, write_property="betweenness")

We can inspect the underlying graph object and observe the newly added properties:

In [48]:
paragraph_metrics.graph.vp.keys()

<generator object PropertyDict.iterkeys at 0x7f91205ee750>

Now, we will export this backend-specific graph object into a `PGFrame`.

In [49]:
new_paragraph_network = paragraph_metrics.get_pgframe()

In [50]:
new_paragraph_network.nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paper_frequency,degree,pagerank,betweenness
@id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
metformin,Entity,2.0,39.0,0.004052,0.004425
blood,Entity,8.0,115.0,0.010854,0.010339
tumor necrosis factor,Entity,3.0,56.0,0.005504,0.016805
immune cell,Entity,2.0,27.0,0.003499,0.010008
plasma,Entity,3.0,57.0,0.00532,0.009967


In [51]:
print("Top 10 nodes by degree")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["degree"]).index:
    print("\t", n)

Top 10 nodes by degree
	 covid-19
	 diabetes mellitus
	 sars-cov-2
	 angiotensin-converting enzyme 2
	 lung
	 coronavirus
	 dipeptidyl peptidase 4
	 glucose
	 sars coronavirus
	 interleukin-6


In [52]:
print("Top 10 nodes by PageRank")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["pagerank"]).index:
    print("\t", n)

Top 10 nodes by PageRank
	 covid-19
	 diabetes mellitus
	 sars-cov-2
	 angiotensin-converting enzyme 2
	 lung
	 dipeptidyl peptidase 4
	 glucose
	 coronavirus
	 sars coronavirus
	 interleukin-6


In [53]:
print("Top 10 nodes by betweenness")
for n in new_paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["betweenness"]).index:
    print("\t", n)

Top 10 nodes by betweenness
	 lymphopenia
	 pulmonary
	 glucose metabolism disorder
	 t-lymphocyte
	 cough
	 chemokine
	 d-dimer measurement
	 kidney
	 interleukin-19
	 ibuprofen


### Compute multiple metrics in one go

Alternatively, we can compute all the metrics in one go. To do so, we need to specify edge attributes used for computing different metrics (if an empty list is specified as a weight list for a metric, computation of this metric is not performed). 

We select the paragraph-based network and re-compute all some of the previously illustrated metrics as follows:

In [54]:
result_metrics = paragraph_metrics.compute_all_node_metrics(
    degree_weights=["frequency"],
    pagerank_weights=["frequency"],
    betweenness_weights=["distance_npmi"])

In [55]:
result_metrics

{'degree': {'frequency': {'ace inhibitor': 39.0,
   'acetaminophen': 5.0,
   'acute lung injury': 128.0,
   'acute respiratory distress syndrome': 139.0,
   'adenosine': 22.0,
   'adipose tissue': 27.0,
   'angioedema': 20.0,
   'angiotensin ii receptor antagonist': 85.0,
   'angiotensin-1': 70.0,
   'angiotensin-2': 18.0,
   'angiotensin-converting enzyme': 13.0,
   'angiotensin-converting enzyme 2': 329.0,
   'animal': 47.0,
   'apoptosis': 18.0,
   'bals r': 6.0,
   'basal': 34.0,
   'blood': 115.0,
   'blood vessel': 28.0,
   'bradykinin': 18.0,
   'c-c motif chemokine 1': 47.0,
   'c-reactive protein': 31.0,
   'cardiac failure': 17.0,
   'cardiovascular disorder': 107.0,
   'cardiovascular system': 115.0,
   'cd44 antigen': 46.0,
   'cellular secretion': 47.0,
   'cerebrovascular': 18.0,
   'chemokine': 43.0,
   'chest pain': 21.0,
   'chloroquine': 7.0,
   'chronic disease': 12.0,
   'chronic kidney disease': 18.0,
   'comorbidity': 19.0,
   'confounding factors': 26.0,
   'coro

## VI. Community detection

_Community detection_ methods partition the network into clusters of densely connected nodes in a way that nodes in the same community are more connected between themselves relatively to the nodes in different communities. In this section we will illustrate the use of the `CommunityDetector` interface provided by BlueGraph for community detection and estimation of its quality using modularity, performance and coverange methods. The unified interface allows us to use various community detection methods available in different graph backends.

First, we create a `NetworkX`-based instance and use several different community detection strategies provided by this library. 

In [56]:
nx_detector = NXCommunityDetector(new_paragraph_network, directed=False)

In [57]:
nx_detector.graph

<networkx.classes.graph.Graph at 0x7f9120037310>

### Louvain algorithm

In [58]:
partition = nx_detector.detect_communities(
    strategy="louvain", weight="npmi")

In [59]:
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))

Modularity:  0.33890209866564786
Performance:  0.789645598562796
Coverage:  0.3949173053650666


### Label propagation

In [60]:
partition = nx_detector.detect_communities(
    strategy="lpa", weight="npmi", intermediate=False)

In [61]:
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))

Modularity:  0.07719091705395371
Performance:  0.3316184876694431
Coverage:  0.9415086728519564


### Stochastic block model

In [62]:
gt_detector = GTCommunityDetector(new_paragraph_network, directed=False)

In [63]:
partition = gt_detector.detect_communities(strategy="sbm", weight="npmi")

In [64]:
print("Modularity: ", nx_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", nx_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", nx_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))

Modularity:  0.21501874092692383
Performance:  0.7382818879634166
Coverage:  0.24606696248487292


### Writing community partition as node properties

In [65]:
nx_detector.detect_communities(
    strategy="louvain", weight="npmi",
    write=True, write_property="louvain_community")

In [66]:
new_paragraph_network = nx_detector.get_pgframe(
    node_prop_types=new_paragraph_network._node_prop_types,
    edge_prop_types=new_paragraph_network._edge_prop_types)

In [67]:
new_paragraph_network.nodes(raw_frame=True).sample(5)

Unnamed: 0_level_0,@type,paper_frequency,degree,pagerank,betweenness,louvain_community
@id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
myocardium,Entity,2.0,17.0,0.002235,0.005852,2
nephropathy,Entity,2.0,12.0,0.001968,0.004963,2
plasmid,Entity,2.0,21.0,0.002723,0.016191,3
angiotensin-converting enzyme,Entity,3.0,13.0,0.002152,0.004591,1
diabetic ketoacidosis,Entity,4.0,46.0,0.004489,0.00397,0


## V. Export network and the computed metrics

In [68]:
# Save graph as JSON
new_paragraph_network.export_json("../data/literature_comention.json")

In [69]:
# Save the graph for Gephi import.
new_paragraph_network.export_to_gephi(
    "../data/gephi_literature_comention", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the Louvain algorithm (with NPMI edge weights), node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

**Full network**
<img src="./figures/literature/full_network.png" alt="Literature co-occurrence network" style="width: 400px;"/>

**Community "Symptoms and comorbidities"**
<img src="./figures/literature/covid_19_comorbidities.png" alt="Drawing" style="width: 400px;"/>

**Community "Viral biology"**
<img src="./figures/literature/virus.png" alt="Drawing" style="width: 600px;"/>

**Community "Immunity"**
<img src="./figures/literature/immunity.png" alt="Drawing" style="width: 600px;"/>

## VII. Minimum spanning trees

A _minimum spanning tree_ of a network is given by a subset of edges that make the network connected ($n - 1$ edges connecting $n$ nodes). Its weighted version minimizes not only the number of edges included in the tree, but the total edge weight.

In the following example we compute a minimum spanning tree minimizing the NPMI-based distance weight of the network edges. We use the `graph_tool`-based implementation of the `PathFinder` interface.

In [70]:
gt_paragraph_path_finder = GTPathFinder(new_paragraph_network, directed=False)

In [71]:
gt_paragraph_path_finder.graph

<Graph object, undirected, with 157 vertices and 2479 edges, 7 internal vertex properties, 6 internal edge properties, at 0x7f912004b810>

In [72]:
tree = graph_tool_to_pgframe(gt_paragraph_path_finder.minimum_spanning_tree(distance="distance_npmi"))

In [73]:
tree.export_to_gephi(
    "../data/gephi_literature_spanning_tree", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the NPMI weight, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

**Full network**
<img src="./figures/literature/tree.png" alt="Literature co-occurrence MST" style="width: 400px;"/>

**Zoom into "covid-19"**
<img src="./figures/literature/tree_covid-19.png" alt="Drawing" style="width: 700px;"/>

## VIII. Shortest path search

The _shortest path search problem_ consisits in finding a sequence of edges from the source node to the target node that minimizes the cumulative weight (or distance) associated to the edges. 

In [74]:
path = gt_paragraph_path_finder.shortest_path("lung", "sars-cov-2")
pretty_print_paths([path])

lung <->  <-> sars-cov-2
         


The cell above illustrates that the single shortest path form 'lung' and 'sars-cov-2' consists of the direct edge between them.

We adapt this problem to the literature exploration task, i.e. having fixed the source and the target concepts (the relation here is actually symmetric as the edges of our network are undirected), we would like to find a _set_ of $n$ shortest paths between them. Moreover, we would like these paths to be _indirect_ (not to include the direct edge from the source to the target). In the following examples we use mutual-information-based edge weights to perform our literature exploration. 

The library includes two strategies for finding such $n$ shortest paths. The first strategy uses Yen's algorithm for finding $n$ loopless shortest paths from the source to the target (https://en.wikipedia.org/wiki/Yen%27s_algorithm).


In [75]:
nx_paragraph_path_finder = NXPathFinder(new_paragraph_network, directed=False)
paths = nx_paragraph_path_finder.n_shortest_paths(
    "lung", "sars-cov-2", n=10,
    distance="distance_npmi", strategy="yen")

In [76]:
pretty_print_paths(paths)

lung <->                                 <-> sars-cov-2
         
         acute lung injury
         pulmonary
         receptor binding
         human
         viral
         angiotensin-converting enzyme 2
         host cell
         dna replication
         acute lung injury <-> pulmonary


The second, _naive_, strategy is suitable in the scenarios when our networks are large and highly dense (then the performance of Yen's algorithm degragates as the number of edges is approaching $O(N^2)$ with $N$ being the number of nodes). 

This strategy simply finds _all_ the indirect shortest paths from the source to the target (in dense graphs the most common such paths are of length 2, i.e. `source <-> intermediary <-> target`, and therefore, the number of such path is roughly proportional to the number of nodes in the network). Then, the cumulative distance score is computed for every path and the top $n$ paths with the best score are selected.

In [77]:
paths = gt_paragraph_path_finder.n_shortest_paths(
    "lung", "sars-cov-2", n=10,
    distance="distance_npmi", strategy="naive")

In [78]:
pretty_print_paths(paths)

lung <->                                 <-> sars-cov-2
         
         acute lung injury
         pulmonary
         receptor binding
         human
         viral
         angiotensin-converting enzyme 2
         host cell
         dna replication
         human kidney organoids


The library provides an additional utility for finding _tripaths_, paths of the shape `source <-> intermediary <-> target`. Setting the parameter `intersecting` to `False` we can ensure that the entities the sets of entities discovered on the paths `source <-> intermediary` and `intermediary <-> target` do not overlap.

In [79]:
("sars-cov-2", "glucose") in new_paragraph_network.edges()

False

In [80]:
gt_paragraph_path_finder.n_shortest_paths(
    "glucose", "sars-cov-2", n=10,
    distance="distance_npmi", strategy="naive")

[('glucose', 'cerebrovascular', 'sars-cov-2'),
 ('glucose', 'plasmid', 'sars-cov-2'),
 ('glucose', 'basal', 'sars-cov-2'),
 ('glucose', 'coronavirus', 'sars-cov-2'),
 ('glucose', 'headache', 'sars-cov-2'),
 ('glucose', 'islet of langerhans', 'sars-cov-2'),
 ('glucose', 'serum', 'sars-cov-2'),
 ('glucose', 'comorbidity', 'sars-cov-2'),
 ('glucose', 'infectious disorder', 'sars-cov-2'),
 ('glucose', 'viral entry', 'sars-cov-2')]

In [81]:
path_a_b, path_b_c = gt_paragraph_path_finder.n_shortest_tripaths(
    "lung", "glucose", "sars-cov-2", 10,
    strategy="naive", distance="distance_npmi", overlap=False)

In [82]:
pretty_print_tripaths("lung", "glucose", "sars-cov-2", 10, path_a_b, path_b_c)

lung ->                                -> glucose ->                   -> sars-cov-2
       
inflammation
                                 
cerebrovascular
       
islet of langerhans
                          
plasmid
       
serum
                                        
basal
       
oral cavity
                                  
coronavirus
       
death
                                        
headache
       
middle east respiratory syndrome
             
comorbidity
       
neutrophil
                                   
infectious disorder
       
chemokine
                                    
viral entry
       
sulfonylurea antidiabetic agent
              
sars coronavirus
       
therapeutic corticosteroid
                   
person


## IX. Nested path search

To explore the space of co-occurring terms in depth, we can run the path search procedure presented above in a _nested fashion_. For each edge $e_1, e_2, ..., e_n$ encountered on a path from the source to the target  from, we can
further expand it into $n$ shortest paths between each pair of successive entities (i.e. paths between $e_1$ and $e_2$, $e_2$ and $e_3$, etc.).

In [83]:
paths = gt_paragraph_path_finder.n_nested_shortest_paths(
    "lung", "glucose", top_level_n=10, nested_n=2, depth=2, distance="distance_npmi",
    strategy="naive")

In [84]:
paths

[('lung', 'middle east respiratory syndrome'),
 ('neutrophil', 'lymphocyte', 'glucose'),
 ('lung', 'pulmonary', 'chemokine'),
 ('lung', 'serum'),
 ('lung', 'serum', 'inflammation'),
 ('lung', 'survival', 'oral cavity'),
 ('lung', 'islet of langerhans'),
 ('sulfonylurea antidiabetic agent', 'oral cavity', 'glucose'),
 ('lung', 'oral cavity', 'glucose'),
 ('lung', 'mouse', 'middle east respiratory syndrome'),
 ('oral cavity', 'blood', 'glucose'),
 ('neutrophil', 'millimole per liter', 'glucose'),
 ('sulfonylurea antidiabetic agent', 'blood', 'glucose'),
 ('lung', 'therapeutic corticosteroid', 'glucose'),
 ('lung', 'acute lung injury', 'chemokine'),
 ('lung', 'oral cavity'),
 ('chemokine', 'growth factor', 'glucose'),
 ('lung', 'serum', 'glucose'),
 ('lung', 'survival', 'sulfonylurea antidiabetic agent'),
 ('middle east respiratory syndrome', 'glucose'),
 ('serum', 'glucose metabolism disorder', 'glucose'),
 ('lung', 'acute lung injury', 'neutrophil'),
 ('lung', 'chemokine', 'glucose'),
 

We can now visualize the subnetwork constructed using the nodes and the edges discovered during our nested path search.

In [85]:
summary_graph = graph_tool_to_pgframe(
    gt_paragraph_path_finder.get_subgraph_from_paths(paths))

In [86]:
print("Number of nodes: ", summary_graph.number_of_nodes())
print("Number of edges: ", summary_graph.number_of_edges())

Number of nodes:  27
Number of edges:  63


In [87]:
# Save the graph for Gephi import.
summary_graph.export_to_gephi(
    "../data/gephi_literature_path_graph", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The resulting example graph visualized with Gephi
<img src="./figures/literature/path_graph_example.png" alt="Literature co-occurrence network" style="width: 800px;"/>