# Literature exploration by term co-occurrance analysis

In this example we illustrate how network analytics can be used for literature exploration. 

The input dataset contains occurrences of different terms in paragraphs of scientific articles previously extracted by means of a Named Entity Recognition (NER) model. This dataset is transformed into three co-occurrence networks: representing paper-, section- and paragraph-level co-occurrence relation between terms. The term relations in the above-mentioned networks are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).

The networks are further analysed using classical tools from complex networks: we find various centrality measures characterizing the importance of extracted terms, we detect term communities representing denesely connected clusters of terms and finally we illustrate how the algorithms for finding shortest paths and minimum spanning trees can be used to perform guided search in networks.

In [41]:
import networkx as nx
import pandas as pd
import numpy as np

In [42]:
from bluegraph.core import PandasPGFrame
from bluegraph.preprocess.generators import generate_cooccurrence_edges

from bluegraph.backends.graph_tool import GTMetricProcessor
from bluegraph.backends.networkx import NXMetricProcessor, NXPathFinder

## Prepare entity-occurrence data property graph

The input data is given by occurrences of different entities in specific paragraphs of scientific articles.

In [43]:
mentions = pd.read_csv("data/literature_NER_example.csv")

Every paragraph is identified using the format `<paper_id>:<section_id>:<paragraph_id>`. From this data we will extract occurrences in distinct papers/sections/paragraphs as follows:

In [44]:
# Extract unique paper/seciton/paragraph identifiers
mentions["paper"] = mentions["occurrence"].apply(
    lambda x: x.split(":")[0])
mentions["section"] = mentions["occurrence"].apply(
    lambda x: ":".join([x.split(":")[0], x.split(":")[1]]))

mentions = mentions.rename(columns={"occurrence": "paragraph"})

We, first, create an empty property graph

In [45]:
graph = PandasPGFrame()

Then we add nodes for unique entities and papers

In [46]:
entity_nodes = mentions["entity"].unique()
graph.add_nodes(entity_nodes)
graph.add_node_types({n: "Entity" for n in entity_nodes})

paper_nodes = mentions["paper"].unique()
graph.add_nodes(paper_nodes)
graph.add_node_types({n: "Paper" for n in paper_nodes})

In [47]:
graph.nodes(raw_frame=True)

Unnamed: 0_level_0,@type
@id,Unnamed: 1_level_1
ace inhibitor,Entity
acetaminophen,Entity
acute lung injury,Entity
acute respiratory distress syndrome,Entity
adenosine,Entity
...,...
78884,Paper
35198,Paper
139943,Paper
172581,Paper


We now add edges from entities to the papers they occur in

In [49]:
occurrence_edges = mentions.groupby(by=["entity", "paper"]).aggregate(set)

In [50]:
occurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,paragraph,section
entity,paper,Unnamed: 2_level_1,Unnamed: 3_level_1
ace inhibitor,184360,"{184360:Caption:68, 184360:Conclusion:62, 1843...","{184360:Conclusion, 184360:Caption, 184360:Com..."
ace inhibitor,197804,{197804:Management Of Children And Young Peopl...,{197804:Management Of Children And Young Peopl...
acetaminophen,179426,{179426:Blood Glucose Monitoring ::: Special A...,{179426:Blood Glucose Monitoring }
acetaminophen,197804,{197804:Management Of Children And Young Peopl...,{197804:Management Of Children And Young Peopl...
acute lung injury,179426,{179426:Role Of Ace/Arbs ::: Special Aspects O...,{179426:Role Of Ace/Arbs }
...,...,...,...
virus,184360,{184360:Anti-Dpp4 Vaccine ::: Therapeutic Pote...,"{184360:Anti-Dpp4 Vaccine , 184360:Sdpp4 As So..."
virus,197804,"{197804:Discussion:44, 197804:Introduction:2}","{197804:Discussion, 197804:Introduction}"
virus,211125,{211125:Discussion:25},{211125:Discussion}
virus,211373,"{211373:Introduction:3, 211373:Introduction:5,...",{211373:Introduction}


In [51]:
graph.add_edges(occurrence_edges.index)
graph.add_edge_types({e: "OccursIn" for e in occurrence_edges.index})

In [52]:
occurrence_edges.index = occurrence_edges.index.rename(["@source_id", "@target_id"])

In [53]:
graph.add_edge_properties(occurrence_edges["paragraph"])
graph.add_edge_properties(occurrence_edges["section"])

In [54]:
graph.edges(raw_frame=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,@type,paragraph,section
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ace inhibitor,184360,OccursIn,"{184360:Caption:68, 184360:Conclusion:62, 1843...","{184360:Conclusion, 184360:Caption, 184360:Com..."
ace inhibitor,197804,OccursIn,{197804:Management Of Children And Young Peopl...,{197804:Management Of Children And Young Peopl...
acetaminophen,179426,OccursIn,{179426:Blood Glucose Monitoring ::: Special A...,{179426:Blood Glucose Monitoring }
acetaminophen,197804,OccursIn,{197804:Management Of Children And Young Peopl...,{197804:Management Of Children And Young Peopl...
acute lung injury,179426,OccursIn,{179426:Role Of Ace/Arbs ::: Special Aspects O...,{179426:Role Of Ace/Arbs }
...,...,...,...,...
virus,184360,OccursIn,{184360:Anti-Dpp4 Vaccine ::: Therapeutic Pote...,"{184360:Anti-Dpp4 Vaccine , 184360:Sdpp4 As So..."
virus,197804,OccursIn,"{197804:Discussion:44, 197804:Introduction:2}","{197804:Discussion, 197804:Introduction}"
virus,211125,OccursIn,{211125:Discussion:25},{211125:Discussion}
virus,211373,OccursIn,"{211373:Introduction:3, 211373:Introduction:5,...",{211373:Introduction}


## Generate co-occurrence networks

We will generate co-occurrence networks for difference occurrence factors (paper/section/paragraph), i.e. an edge between a pair of entities is added if they co-occur in the same paper/section/paragraph/

### Paper-based co-occurrence

In [55]:
paper_cooccurrence_edges = generate_cooccurrence_edges(
    graph, "OccursIn", compute_statistics=["frequency", "ppmi", "npmi"])

Examining 12246 pairs of terms for co-occurrence...


In [56]:
paper_cooccurrence_edges["@type"] = "CoOccursWith"
paper_cooccurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ace inhibitor,acetaminophen,{197804},1,2.321928,0.537244,CoOccursWith
ace inhibitor,acute lung injury,"{184360, 197804}",2,2.321928,0.698970,CoOccursWith
ace inhibitor,acute respiratory distress syndrome,"{184360, 197804}",2,1.736966,0.522879,CoOccursWith
ace inhibitor,adenosine,{184360},1,2.321928,0.537244,CoOccursWith
ace inhibitor,adipose tissue,"{184360, 197804}",2,2.736966,0.823909,CoOccursWith
...,...,...,...,...,...,...
viral,viral infection,"{214924, 184360, 211125, 211373}",4,2.000000,0.861353,CoOccursWith
viral,virus,"{184360, 179426, 211373, 214924, 211125}",5,1.514573,0.757287,CoOccursWith
viral entry,viral infection,{214924},1,1.321928,0.305865,CoOccursWith
viral entry,virus,"{214924, 179426}",2,1.514573,0.455932,CoOccursWith


In [16]:
entity_nodes = graph.nodes_of_type("Entity").copy()

In [17]:
paper_frequency = mentions.groupby("entity").aggregate(set)["paper"].apply(len)
paper_frequency.name = "paper_frequency"

section_frequency = mentions.groupby("entity").aggregate(set)["section"].apply(len)
section_frequency.name = "section_frequency"

paragraph_frequency = mentions.groupby("entity").aggregate(set)["paragraph"].apply(len)
paragraph_frequency.name = "paragraph_frequency"

entity_nodes["paper_frequency"] = paper_frequency
entity_nodes["section_frequency"] = section_frequency
entity_nodes["paragraph_frequency"] = paragraph_frequency

In [18]:
paper_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paper_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
        "section_frequency": "numeric",
        "paragraph_frequency": "numeric"
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
    })

In [19]:
paper_network.edges(raw_frame=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ace inhibitor,acetaminophen,{197804},1,2.321928,0.537244,CoOccursWith
ace inhibitor,acute lung injury,"{184360, 197804}",2,2.321928,0.698970,CoOccursWith
ace inhibitor,acute respiratory distress syndrome,"{184360, 197804}",2,1.736966,0.522879,CoOccursWith
ace inhibitor,adenosine,{184360},1,2.321928,0.537244,CoOccursWith
ace inhibitor,adipose tissue,"{184360, 197804}",2,2.736966,0.823909,CoOccursWith
...,...,...,...,...,...,...
viral,viral infection,"{214924, 184360, 211125, 211373}",4,2.000000,0.861353,CoOccursWith
viral,virus,"{184360, 179426, 211373, 214924, 211125}",5,1.514573,0.757287,CoOccursWith
viral entry,viral infection,{214924},1,1.321928,0.305865,CoOccursWith
viral entry,virus,"{214924, 179426}",2,1.514573,0.455932,CoOccursWith


In [20]:
paper_network.nodes(raw_frame=True)

Unnamed: 0_level_0,@type,paper_frequency,section_frequency,paragraph_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ace inhibitor,Entity,2,6,9
acetaminophen,Entity,2,3,4
acute lung injury,Entity,4,11,12
acute respiratory distress syndrome,Entity,6,11,17
adenosine,Entity,2,2,2
...,...,...,...,...
vildagliptin,Entity,2,2,6
viral,Entity,5,15,21
viral entry,Entity,2,5,5
viral infection,Entity,4,6,6


### Section-based co-occurrence

In [21]:
def aggregate_sections(data):
    return set(sum(data["section"].apply(list), []))

In [22]:
section_cooccurrence_edges = generate_cooccurrence_edges(
    graph, "OccursIn", 
    factor_aggregator=aggregate_sections,
    compute_statistics=["frequency", "ppmi", "npmi"]
)

Examining 12246 pairs of terms for co-occurrence...


In [23]:
section_cooccurrence_edges["@type"] = "CoOccursWith"
section_cooccurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ace inhibitor,acetaminophen,{197804:Management Of Children And Young Peopl...,1,2.557995,0.380206,CoOccursWith
ace inhibitor,acute lung injury,"{184360:Caption, 184360:Aceis And Arbs }",2,1.683526,0.293916,CoOccursWith
ace inhibitor,acute respiratory distress syndrome,{184360:Caption},1,0.683526,0.101595,CoOccursWith
ace inhibitor,angioedema,{184360:Combined Therapeutic Potential Targeti...,1,2.557995,0.380206,CoOccursWith
ace inhibitor,angiotensin ii receptor antagonist,"{184360:Caption, 184360:Combined Therapeutic P...",3,3.405992,0.662263,CoOccursWith
...,...,...,...,...,...,...
viral,viral entry,"{179426:Conclusion, 179426:Role Of Dpp4 Enzyme...",5,2.821030,0.640271,CoOccursWith
viral,viral infection,"{214924:The Immune Response To Sars-Cov-2 , 21...",6,2.821030,0.680922,CoOccursWith
viral,virus,"{214924:Abstract, 214924:The Immune Response T...",9,1.821030,0.511813,CoOccursWith
viral entry,virus,{179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,2,1.236067,0.215797,CoOccursWith


In [24]:
section_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=section_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
        "section_frequency": "numeric",
        "paragraph_frequency": "numeric"
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
    })

### Paragraph-based co-occurrence

In [25]:
def aggregate_paragraphs(data):
    return set(sum(data["paragraph"].apply(list), []))

In [26]:
paragraph_cooccurrence_edges = generate_cooccurrence_edges(
    graph, "OccursIn", 
    factor_aggregator=aggregate_paragraphs,
    compute_statistics=["frequency", "ppmi", "npmi"],
    parallelize=True, cores=8)

Examining 12246 pairs of terms for co-occurrence...


In [27]:
paragraph_cooccurrence_edges["@type"] = "CoOccursWith"
paragraph_cooccurrence_edges

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
acetaminophen,blood,{179426:Blood Glucose Monitoring ::: Special A...,1,1.321928,0.162613,CoOccursWith
acetaminophen,covid-19,{197804:Management Of Children And Young Peopl...,2,0.000000,0.000000,CoOccursWith
acetaminophen,glucose,{179426:Blood Glucose Monitoring ::: Special A...,1,0.669851,0.082400,CoOccursWith
acetaminophen,ibuprofen,{197804:Management Of Children And Young Peopl...,2,5.129283,0.719467,CoOccursWith
acetaminophen,insulin,{179426:Blood Glucose Monitoring ::: Special A...,1,1.959358,0.241025,CoOccursWith
...,...,...,...,...,...,...
tumor necrosis factor,viral,{214924:The Protective Role Of Angiotensin-Con...,2,2.415037,0.338749,CoOccursWith
tumor necrosis factor,virus,{214924:The Protective Role Of Angiotensin-Con...,1,1.347923,0.165811,CoOccursWith
vildagliptin,viral,{179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,1,1.152003,0.141710,CoOccursWith
vildagliptin,viral entry,{179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,1,3.222392,0.396393,CoOccursWith


In [28]:
paragraph_network = PandasPGFrame.from_frames(
    nodes=entity_nodes,
    edges=paragraph_cooccurrence_edges,
    node_prop_types={
        "paper_frequency": "numeric",
        "section_frequency": "numeric",
        "paragraph_frequency": "numeric"
    },
    edge_prop_types={
        "frequency": "numeric",
        "ppmi": "numeric",
    })

**NB**

_Positive pointwise mutual information (PPMI)_ is defined as $PPMI(x, y) = \log_2{\frac{p(x, y)}{p(x)p(y)}} $, if $p(x) \neq 0$ and $p(y) \neq 0$, and $PPMI(x, y) = 0$ otherwise.

_Normalized pointwise mutual information (NPMI)_ is defined as $NPMI(x, y) = \frac{PPMI(x, y)}{-\log_2{p(x, y)}} $.

The mutual-information-based distances are caluculated as $\frac{1}{(N/P)PMI}$.

### Create metric processor objects

In [29]:
import math

def compute_distance(x):
    return 1 / x if x > 0 else math.inf

In [31]:
npmi_distance = paper_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paper_network.add_edge_properties(npmi_distance, "numeric")

In [32]:
paper_network.edges(raw_frame=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,common_factors,frequency,ppmi,npmi,@type,distance_npmi
@source_id,@target_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ace inhibitor,acetaminophen,{197804},1,2.321928,0.537244,CoOccursWith,1.861353
ace inhibitor,acute lung injury,"{184360, 197804}",2,2.321928,0.698970,CoOccursWith,1.430677
ace inhibitor,acute respiratory distress syndrome,"{184360, 197804}",2,1.736966,0.522879,CoOccursWith,1.912489
ace inhibitor,adenosine,{184360},1,2.321928,0.537244,CoOccursWith,1.861353
ace inhibitor,adipose tissue,"{184360, 197804}",2,2.736966,0.823909,CoOccursWith,1.213727
...,...,...,...,...,...,...,...
viral,viral infection,"{214924, 184360, 211125, 211373}",4,2.000000,0.861353,CoOccursWith,1.160964
viral,virus,"{184360, 179426, 211373, 214924, 211125}",5,1.514573,0.757287,CoOccursWith,1.320504
viral entry,viral infection,{214924},1,1.321928,0.305865,CoOccursWith,3.269412
viral entry,virus,"{214924, 179426}",2,1.514573,0.455932,CoOccursWith,2.193310


In [33]:
paper_metrics.graph.v

NameError: name 'paper_metrics' is not defined

In [40]:
paper_metrics.graph.ep["distance_npmi"][e]

1.8613531161467862

In [35]:
npmi_distance = section_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
section_network.add_edge_properties(npmi_distance, "numeric")

In [36]:
npmi_distance = paragraph_network.edges(raw_frame=True)["npmi"].apply(compute_distance)
npmi_distance.name = "distance_npmi"
paragraph_network.add_edge_properties(npmi_distance, "numeric")

Density of a network is quantified by the proportion of all possible edges ($n(n-1) / 2$ for the undirected network with $n$ nodes) that are realized.

In [37]:
paper_metrics = GTMetricProcessor(paper_network)
section_metrics = GTMetricProcessor(section_network)
paragraph_metrics = GTMetricProcessor(paragraph_network)

In [94]:
paper_metrics.graph.ep

{'common_factors': <EdgePropertyMap object with value type 'string', for Graph 0x7fd48ea853c8, at 0x7fd48ed3de80>, 'frequency': <EdgePropertyMap object with value type 'double', for Graph 0x7fd48ea853c8, at 0x7fd48e096438>, 'ppmi': <EdgePropertyMap object with value type 'double', for Graph 0x7fd48ea853c8, at 0x7fd48e096a58>, 'npmi': <EdgePropertyMap object with value type 'string', for Graph 0x7fd48ea853c8, at 0x7fd48e0964a8>, '@type': <EdgePropertyMap object with value type 'string', for Graph 0x7fd48ea853c8, at 0x7fd48e096b38>, 'distance_npmi': <EdgePropertyMap object with value type 'string', for Graph 0x7fd48ea853c8, at 0x7fd48e096d68>}

In [35]:
print("Density of the paper-based network: ", paper_metrics.density())
print("Density of the section-based network: ", section_metrics.density())
print("Density of the paragraph-based network: ", paragraph_metrics.density())

Density of the paper-based network:  0.7960150253143884
Density of the section-based network:  0.42520006532745386
Density of the paragraph-based network:  0.22497141923893516


The results above show that in the paper, section and paragraph network repsectively 80%, 42% and 22% of all possible term pairs co-occur at least once.

## Nearest neighours by (N/P)MI

To illustrate the importance of computing mutual-information-based scores over raw frequencies consider the following example, where we would like to estimate top closest (most related) neighbors to a specific term.

We will use the paragraph-based network and the raw co-occurrence frequency as the weight of our co-occurrence relation.

In [36]:
paragraph_path_finder = NXPathFinder(paragraph_network)

In [37]:
paragraph_path_finder.top_neighbors("glucose", 10, weight="frequency")

{'covid-19': 29,
 'diabetes mellitus': 29,
 'blood': 18,
 'insulin': 11,
 'death': 9,
 'coronavirus': 8,
 'hyperglycemia': 8,
 'infectious disorder': 6,
 'inflammation': 6,
 'sars coronavirus': 6}

In [38]:
paragraph_path_finder.top_neighbors("lung", 10, weight="frequency")

{'covid-19': 24,
 'angiotensin-converting enzyme 2': 16,
 'sars-cov-2': 13,
 'acute lung injury': 12,
 'diabetes mellitus': 10,
 'pulmonary': 10,
 'sars coronavirus': 10,
 'human': 9,
 'viral': 9,
 'mouse': 7}

We observe that 'glucose' and 'lung' share a lot of the closest neighbors by raw frequency. If we look into the list of top 10 entities by paragraph frequency in the entire corpus and we notice that 'glucose' and 'blood' co-occur the most with the terms that are simply the most frequent in our corpus, such as 'covid-19' and 'diabetes mellitus'.

(Closest inspection of the distribution of weighted node degrees suggests that the network contains _hubs_, nodes with significantly high-degree connectivity to other nodes.)

In [39]:
paragraph_network.nodes(raw_frame=True).nlargest(10, columns=["paragraph_frequency"])

Unnamed: 0_level_0,@type,paper_frequency,section_frequency,paragraph_frequency
@id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
covid-19,Entity,20,94,192
diabetes mellitus,Entity,19,82,156
sars-cov-2,Entity,7,36,63
glucose,Entity,13,31,44
angiotensin-converting enzyme 2,Entity,7,26,41
dipeptidyl peptidase 4,Entity,4,17,41
coronavirus,Entity,16,27,31
blood,Entity,8,21,28
lung,Entity,6,18,28
sars coronavirus,Entity,8,18,26


To account for the presence of such hubs, we use the mutual-information-based scores presented above. They 'balance' the influence of the highly connected hub nodes such as 'covid-19' and 'diabetes mellitus' in our example.

In [40]:
paragraph_path_finder.top_neighbors("glucose", 10, weight="npmi")

{'blood': 0.5133209650995287,
 'glucose metabolism disorder': 0.43558951200762297,
 'insulin': 0.41957609533629175,
 'thrombophilia': 0.4079646453270325,
 'bals r': 0.3744908698338857,
 'insulin infusion': 0.3744908698338857,
 'leukopenia': 0.3744908698338857,
 'millimole per liter': 0.3744908698338857,
 'troponin t, cardiac muscle': 0.3744908698338857,
 'hyperglycemia': 0.3255525953220345}

In [41]:
paragraph_path_finder.top_neighbors("lung", 10, weight="npmi")

{'acute lung injury': 0.731006557092012,
 'pulmonary': 0.6362945400636919,
 'angiotensin-converting enzyme 2': 0.4757184436640079,
 'receptor binding': 0.46595542454855043,
 'animal': 0.4465392749200213,
 'viral infection': 0.4465392749200213,
 'mouse': 0.4362945258726772,
 'angiotensin-1': 0.4259996541516483,
 'viral': 0.4233482775367079,
 'human': 0.4098154380746763}

## Node centralities

In this example we will compute the Degree and PageRank centralities only for the raw frequency, and the Betweenness centrality for the mutual-information-based scores.

_Degree centrality_ is given by the sum of weights of all incident edges of the given node and characterizes the importance of the node in the network in terms of its connectivity to other nodes (high degree = high connectivity).

In [42]:
paragraph_metrics.degree_centrality("frequency", write=True, write_property="degree")

_PageRank centrality_ is another measure that estimated the importance of the given node in the network. Roughly speaking it can be interpreted as the probablity that having landed on a random node in the network we will jump to the given node (here the edge weights are taken into account").

https://en.wikipedia.org/wiki/PageRank

In [43]:
paragraph_metrics.pagerank_centrality("frequency", write=True, write_property="pagerank")

We then compute the betweenness centrality based on the NPMI distances.

_Betweenness centrality_ is a node importance measure that estimates how often a shortest path between a pair of nodes will pass through the given node.

In [44]:
paragraph_metrics.betweenness_centrality("distance_npmi", write=True, write_property="betweenness")

In [45]:
new_paragraph_graph = paragraph_metrics.get_pgframe()

In [46]:
new_paragraph_graph.nodes(raw_frame=True)

Unnamed: 0_level_0,@type,paper_frequency,section_frequency,paragraph_frequency,degree,pagerank,betweenness
@id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ace inhibitor,Entity,2.0,6.0,9.0,47.0,0.004152,0.000207
acetaminophen,Entity,2.0,3.0,4.0,7.0,0.001491,0.000000
acute lung injury,Entity,4.0,11.0,12.0,133.0,0.009832,0.015261
acute respiratory distress syndrome,Entity,6.0,11.0,17.0,153.0,0.011580,0.014640
adenosine,Entity,2.0,2.0,2.0,24.0,0.002613,0.009377
...,...,...,...,...,...,...,...
vildagliptin,Entity,2.0,2.0,6.0,59.0,0.005170,0.003060
viral,Entity,5.0,15.0,21.0,227.0,0.016691,0.012242
viral entry,Entity,2.0,5.0,5.0,58.0,0.004936,0.015509
viral infection,Entity,4.0,6.0,6.0,60.0,0.004988,0.006286


In [47]:
print("Top 10 nodes by degree")
for n in new_paragraph_graph.nodes(raw_frame=True).nlargest(10, columns=["degree"]).index:
    print("\t", n)

Top 10 nodes by degree
	 covid-19
	 diabetes mellitus
	 sars-cov-2
	 angiotensin-converting enzyme 2
	 dipeptidyl peptidase 4
	 lung
	 glucose
	 coronavirus
	 sars coronavirus
	 viral


In [48]:
print("Top 10 nodes by PageRank")
for n in new_paragraph_graph.nodes(raw_frame=True).nlargest(10, columns=["pagerank"]).index:
    print("\t", n)

Top 10 nodes by PageRank
	 covid-19
	 diabetes mellitus
	 sars-cov-2
	 angiotensin-converting enzyme 2
	 dipeptidyl peptidase 4
	 glucose
	 lung
	 coronavirus
	 viral
	 sars coronavirus


In [49]:
print("Top 10 nodes by betweenness")
for n in new_paragraph_graph.nodes(raw_frame=True).nlargest(10, columns=["betweenness"]).index:
    print("\t", n)

Top 10 nodes by betweenness
	 lymphopenia
	 pulmonary
	 glucose metabolism disorder
	 t-lymphocyte
	 cough
	 chemokine
	 d-dimer measurement
	 kidney
	 interleukin-19
	 ibuprofen


## Compute multiple metrics in one go

Alternatively, we can compute all the metrics in one go. To do so, we need to specify edge attributes used for computing different metrics (if an empty list is specified as a weight list for a metric, computation of this metric is not performed). 

We select the paragraph-based network and re-compute all some of the previously illustrated metrics as follows:

In [58]:
result_metrics = paragraph_metrics.compute_all_node_metrics(
    degree_weights=["frequency"],
    pagerank_weights=["frequency"],
    betweenness_weights=["distance_npmi"])

In [59]:
result_metrics

{'degree': {'frequency': {'ace inhibitor': 47.0,
   'acetaminophen': 7.0,
   'acute lung injury': 133.0,
   'acute respiratory distress syndrome': 153.0,
   'adenosine': 24.0,
   'adipose tissue': 28.0,
   'angioedema': 23.0,
   'angiotensin ii receptor antagonist': 87.0,
   'angiotensin-1': 77.0,
   'angiotensin-2': 19.0,
   'angiotensin-converting enzyme': 13.0,
   'angiotensin-converting enzyme 2': 366.0,
   'animal': 53.0,
   'apoptosis': 20.0,
   'bals r': 6.0,
   'basal': 36.0,
   'blood': 174.0,
   'blood vessel': 30.0,
   'bradykinin': 18.0,
   'c-c motif chemokine 1': 48.0,
   'c-reactive protein': 35.0,
   'cardiac failure': 18.0,
   'cardiovascular disorder': 109.0,
   'cardiovascular system': 119.0,
   'cd44 antigen': 46.0,
   'cellular secretion': 49.0,
   'cerebrovascular': 19.0,
   'chemokine': 45.0,
   'chest pain': 23.0,
   'chloroquine': 8.0,
   'chronic disease': 12.0,
   'chronic kidney disease': 19.0,
   'comorbidity': 19.0,
   'confounding factors': 26.0,
   'coro

## Community detection

_Community detection_ methods partition the network into clusters of densely connected nodes in a way that nodes in the same community are more connected between themselves relatively to the nodes in different communities.

The _modularity_ measure quantifies how well the communities are 'separated' (https://en.wikipedia.org/wiki/Modularity_(networks)).

In [None]:
print("Paper-based graph")
print("-----------------")
_ = detect_communities(paper_comention_network, weight="frequency", set_attr="community")
_ = detect_communities(paper_comention_network, weight="npmi", set_attr="community_npmi")
print()

print("Section-based graph")
print("-------------------")
_ = detect_communities(section_comention_network, weight="frequency", set_attr="community")
_ = detect_communities(section_comention_network, weight="npmi", set_attr="community_npmi")
print()

print("Paragraph-based graph")
print("---------------------")
_ = detect_communities(paragraph_comention_network, weight="frequency", set_attr="community")
_ = detect_communities(paragraph_comention_network, weight="npmi", set_attr="community_npmi")
print()

In the example above we can see that the best partition is performed on the paragraph-based network using the NPMI edge weight (resulting in the modularity 0.34).

## Export network and the computed metrics

In [None]:
# Save graph nodes as a pickled pandas.DataFrame
save_network(paragraph_comention_network, "data/literature_comention")

In [None]:
# Save the graph for Gephi import.
save_to_gephi(
    paragraph_comention_network, "data/gephi_literature_comention", 
    node_attr_mapping = {
        "degree_frequency": "Degree",
        "pagerank_frequency": "PageRank",
        "betweenness_distance_npmi": "Betweenness",
        "community_npmi": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the NPMI weight, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

**Full network**
<img src="./figures/literature/full_network.png" alt="Literature co-occurrence network" style="width: 400px;"/>

**Community "Symptoms and comorbidities"**
<img src="./figures/literature/covid_19_comorbidities.png" alt="Drawing" style="width: 400px;"/>

**Community "Viral biology"**
<img src="./figures/literature/virus.png" alt="Drawing" style="width: 600px;"/>

**Community "Immunity"**
<img src="./figures/literature/immunity.png" alt="Drawing" style="width: 600px;"/>

## Minimum spanning trees

A _minimum spanning tree_ of a network is given by a subset of edges that make the network connected ($n - 1$ edges connecting $n$ nodes). Its weighted version minimizes not only the number of edges included in the tree, but the total edge weight.

In the following example we compute a minimum spanning tree minimizing the NPMI-based distance weight of the network edges.

In [None]:
tree = minimum_spanning_tree(paragraph_comention_network, weight="distance_npmi")

In [None]:
save_to_gephi(
    tree, "data/gephi_literature_spanning_tree", 
    node_attr_mapping = {
        "degree_frequency": "Degree",
        "pagerank_frequency": "PageRank",
        "betweenness_distance_npmi": "Betweenness",
        "community_npmi": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the NPMI weight, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

**Full network**
<img src="./figures/literature/tree.png" alt="Literature co-occurrence MST" style="width: 400px;"/>

**Zoom into "covid-19"**
<img src="./figures/literature/tree_covid-19.png" alt="Drawing" style="width: 700px;"/>

## Shortest path search

The _shortest path search problem_ consisits in finding a sequence of edges from the source node to the target node that minimizes the cumulative weight (or distance) associated to the edges. 

In [None]:
path = single_shortest_path(paragraph_comention_network, "lung", "sars-cov-2")
pretty_print_paths([path])

The cell above illustrates that the single shortest path form 'lung' and 'sars-cov-2' consists of the direct edge between them.

We adapt this problem to the literature exploration task, i.e. having fixed the source and the target concepts (the relation here is actually symmetric as the edges of our network are undirected), we would like to find a _set_ of $n$ shortest paths between them. Moreover, we would like these paths to be _indirect_ (not to include the direct edge from the source to the target). In the following examples we use mutual-information-based edge weights to perform our literature exploration. 

The library includes two strategies for finding such $n$ shortest paths. The first strategy uses Yen's algorithm for finding $n$ loopless shortest paths from the source to the target (https://en.wikipedia.org/wiki/Yen%27s_algorithm).


In [None]:
paths = top_n_paths(
    paragraph_comention_network, "lung", "sars-cov-2", n=10,
    distance="distance_npmi", strategy="yen")

In [None]:
pretty_print_paths(paths)

The second, _naive_, strategy is suitable in the scenarios when our networks are large and highly dense (then the performance of Yen's algorithm degragates as the number of edges is approaching $O(N^2)$ with $N$ being the number of nodes). 

This strategy simply finds _all_ the indirect shortest paths from the source to the target (in dense graphs the most common such paths are of length 2, i.e. `source <-> intermediary <-> target`, and therefore, the number of such path is roughly proportional to the number of nodes in the network). Then, the cumulative distance score is computed for every path and the top $n$ paths with the best score are selected.

In [None]:
 paths = top_n_paths(
    paragraph_comention_network, "lung", "sars-cov-2", n=10,
    distance="distance_npmi",
    strategy="naive")

In [None]:
pretty_print_paths(paths)

We can also look for shortest paths in the minimum spanning tree constructed using the mutual-information-based scores (there is exactly one such path).

In [None]:
path = single_shortest_path(tree, "lung", "sars-cov-2")
pretty_print_paths([path])

The library provides an additional utility for finding _tripaths_, paths of the shape `source <-> intermediary <-> target`. Setting the parameter `intersecting` to `False` we can ensure that the entities the sets of entities discovered on the paths `source <-> intermediary` and `intermediary <-> target` do not overlap.

In [None]:
path_a_b, path_b_c =  top_n_tripaths(
    paper_comention_network, "lung", "glucose", "sars-cov-2", 10,
    strategy="yen", distance="distance_npmi", intersecting=False)

In [None]:
pretty_print_tripaths("lung", "glucose", "sars-cov-2", 10, path_a_b, path_b_c)

## Nested path search

To explore the space of co-occurring terms in depth, we can run the path search procedure presented above in a _nested fashion_. For each edge $e_1, e_2, ..., e_n$ encountered on a path from the source to the target  from, we can
further expand it into $n$ shortest paths between each pair of successive entities (i.e. paths between $e_1$ and $e_2$, $e_2$ and $e_3$, etc.).

In [None]:
paths = top_n_nested_paths(
    paragraph_comention_network, "lung", "glucose",
    n=10, nested_n=2, depth=2, distance="distance_npmi",
    strategy="naive")

In [None]:
print(paths)

We can now visualize the subnetwork constructed using the nodes and the edges discovered during our nested path search.

In [None]:
summary_graph = graph_from_paths(paths, source_graph=paper_comention_network)

In [None]:
print("Number of nodes: ", summary_graph.number_of_nodes())
print("Number of edges: ", summary_graph.number_of_edges())

In [None]:
# Save the graph for Gephi import.
save_to_gephi(
    summary_graph, "data/gephi_literature_path_graph", 
    node_attr_mapping = {
        "degree_frequency": "Degree",
        "pagerank_frequency": "PageRank",
        "betweenness_distance_npmi": "Betweenness",
        "community_npmi": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The resulting graph visualized with Gephi
<img src="./figures/literature/path_graph.png" alt="Literature co-occurrence network" style="width: 800px;"/>