## Prerequisites and installation instructions


In order to run this notebook Neo4j database must be installed and started (please, see [Neo4j installation instructions](https://neo4j.com/docs/operations-manual/current/installation/)). Typically, the Neo4j-based interfaces provided by BlueGraph require the database uri (the bolt port), username and password to be provided. In addition, BlueGraph uses the Neo4j Graph Data Science (GDS) library, which should be installed separately for the database on which you would like to run the analytics (see [installation instructions](https://neo4j.com/docs/graph-data-science/current/installation/)). Current supported Neo4j GDS version is `>=1.6.1`.

BlueGraph and the set of dependecies supporting `neo4j` can be installed using:

 ```
 pip install bluegraph[neo4j]
 ```

# NASA dataset keywords analysis

In this notebook we use graph-based co-occurrence analysis on the publicly available Data catalog of NASA (https://data.nasa.gov/browse, and the API endpoint https://data.nasa.gov/data.json). This dataset consists of the meta-data for different NASA datasets. 

We will work on the sets of keywords attached to each dataset and build a keyword co-occurrence graph describing relations between different dataset keywords. The keyword relations in the above-mentioned graph are quantified using mutual-information-based scores (normalized pointwise mutual information).

See the related tutorial here: https://www.tidytextmining.com/nasa.html

In this tutorial we will use the Neo4j-based implementation of different analytics interfaces provided by BlueGraph. Therefore, in order to use it, you need a running instance of the Neo4j database (see installation [instructions](https://neo4j.com/docs/operations-manual/current/installation/)).

In [None]:
import os
import json
import pandas as pd
import requests
import getpass

In [None]:
from bluegraph.core import (PandasPGFrame,
                            pretty_print_paths,
                            pretty_print_tripaths)
from bluegraph.preprocess.generators import CooccurrenceGenerator
from bluegraph.backends.neo4j import (pgframe_to_neo4j,
                                      Neo4jMetricProcessor,
                                      Neo4jPathFinder,
                                      Neo4jCommunityDetector,
                                      Neo4jGraphProcessor)
from bluegraph.backends.networkx import NXPathFinder, networkx_to_pgframe

## I. Data preparation

Download and read the NASA dataset.

In [None]:
NASA_META_DATA_URL = "https://data.nasa.gov/data.json"
if not os.path.isfile("../data/nasa.json"):
    r = requests.get(NASA_META_DATA_URL)
    open("../data/nasa.json", "wb").write(r.content)

In [None]:
with open("../data/nasa.json", "r") as f:
    data = json.load(f)

In [None]:
print("Example dataset: ")
print("----------------")
print(json.dumps(data["dataset"][0], indent="   "))

print()
print("Keywords: ", data["dataset"][0]["keyword"])

Create a dataframe with keyword occurrence in different datasets

In [None]:
rows = []
for el in data['dataset']:
    row = [el["identifier"]]
    if "keyword" in el:
        for k in el["keyword"]:
            rows.append(row + [k])
keyword_data = pd.DataFrame(rows, columns=["dataset", "keyword"])

In [None]:
keyword_data

Aggregate dataset ids for each keyword and select the 500 most frequently used keywords.

In [None]:
n = 500

In [None]:
aggregated_datasets = keyword_data.groupby("keyword").aggregate(set)["dataset"]
most_frequent_keywords = list(aggregated_datasets.apply(len).nlargest(n).index)
most_frequent_keywords[:5]

Create a property graph object whose nodes are unique keywords.

In [None]:
graph = PandasPGFrame()
graph.add_nodes(most_frequent_keywords)
graph.add_node_types({n: "Keyword" for n in most_frequent_keywords})

Add sets of dataset ids as properties of our keyword nodes.

In [None]:
aggregated_datasets.index.name = "@id"
graph.add_node_properties(aggregated_datasets, prop_type="category")

In [None]:
graph._nodes.sample(5)

In [None]:
n_datasets = len(keyword_data["dataset"].unique())
print("Total number of dataset: ", n_datasets)

## II. Co-occurrence graph generation

We create a co-occurrence graph using the 500 most frequent keywords: nodes are keywords and a pair of nodes is connected with an undirected edge if two corresponding keywords co-occur in at lease one dataset. Moreover, the edges are equipped with weights corresponding to:

- raw co-occurrence frequency
- normalized pointwise mutual information (NPMI)
- frequency- and mutual-information-based distances (1 / frequency, 1 / NPMI)

In [None]:
generator = CooccurrenceGenerator(graph)
comention_edges = generator.generate_from_nodes(
    "dataset", total_factor_instances=n_datasets,
    compute_statistics=["frequency", "npmi"])

Remove edges with zero NPMI

In [None]:
comention_edges = comention_edges[comention_edges["npmi"] > 0]

Compute the NPMI-based distance score

In [None]:
comention_edges.loc[:, "distance_npmi"] = comention_edges.loc[:, "npmi"].apply(lambda x: 1 / x)

Add generated edges to the property graph.

In [None]:
graph.remove_node_properties("dataset") # Remove datasets from node properties
graph._edges = comention_edges.drop(columns=["common_factors"])
graph._edge_prop_types = {
    "frequency": "numeric",
    "npmi": "numeric",
    "distance_npmi": "numeric"
}

In [None]:
graph.edges(raw_frame=True).sample(5)

## III. Initializing Neo4j graph from a PGFrame

In this section we will populate a Neo4j database with the generated keyword co-occurrence property graph.

In the cells below provide the credentials for connecting to your instance of the Neo4j database.

In [None]:
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"

In [None]:
NEO4J_PASSWORD = getpass.getpass()

Populate the Neo4j database with the nodes and edges of the generated property graph using `pgframe_to_neo4j`. We specify labels of nodes (`Keyword`) and edges (`CoOccurs`) to use for the new elements.

In [None]:
NODE_LABEL = "Keyword"
EDGE_LABEL = "CoOccurs"

In [None]:
# (!) If you run this cell multiple times, you may create nodes and edges of the graph
# multiple times, if you have already run the notebook, set the parameter `pgframe` to None
# this will prevent population of the Neo4j database with the generated graph, but will create
# the necessary `Neo4jGraphView` object.
graph_view = pgframe_to_neo4j(
    pgframe=graph,  # None, if no population is required
    uri=NEO4J_URI, username=NEO4J_USER, password=NEO4J_PASSWORD, 
    node_label=NODE_LABEL, edge_label=EDGE_LABEL,
    directed=False)

In [None]:
# # If you want to clear the database from created elements, run
# graph_view._clear()

## IV. Nearest neighours by NPMI

In this section we will compute top 10 neighbors of the keywords 'mars' and 'saturn' by the highest NPMI.

To do so, we will use the `top_neighbors` method of the `PathFinder` interface provided by the BlueGraph. This interface allows us to search for top neighbors with the highest edge weight. In this example, we use Neo4j-based `Neo4jPathFinder` interface.

In [None]:
path_finder = Neo4jPathFinder.from_graph_object(graph_view)

In [None]:
path_finder.top_neighbors("mars", 10, weight="npmi")

In [None]:
path_finder.top_neighbors("saturn", 10, weight="npmi")

## V. Graph metrics and node centrality measures

BlueGraph provides the `MetricProcessor` interface for computing various graph statistics. We will use Neo4j-based `Neo4jMetricProcessor` interface.

In [None]:
metrics = Neo4jMetricProcessor.from_graph_object(graph_view)

In [None]:
print("Density of the constructed network: ", metrics.density())

### Node centralities

In this example we will compute the Degree and PageRank centralities only for the raw frequency, and the Betweenness centrality for the mutual-information-based scores. We will use methods provided by the `MetricProcessor` interface in the _write_ mode, i.e. computed metrics will be written as node properties of the underlying graph object.

_Degree centrality_ is given by the sum of weights of all incident edges of the given node and characterizes the importance of the node in the network in terms of its connectivity to other nodes (high degree = high connectivity).

In [None]:
metrics.degree_centrality("frequency", write=True, write_property="degree")

_PageRank centrality_ is another measure that estimated the importance of the given node in the network. Roughly speaking it can be interpreted as the probablity that having landed on a random node in the network we will jump to the given node (here the edge weights are taken into account").

https://en.wikipedia.org/wiki/PageRank

In [None]:
metrics.pagerank_centrality("frequency", write=True, write_property="pagerank")

We then compute the betweenness centrality based on the NPMI distances.

_Betweenness centrality_ is a node importance measure that estimates how often a shortest path between a pair of nodes will pass through the given node.

In [None]:
metrics.betweenness_centrality("distance_npmi", write=True, write_property="betweenness")

Now, we will export this backend-specific graph object into a `PGFrame`.

In [None]:
new_graph = metrics.get_pgframe(node_prop_types=graph._node_prop_types, edge_prop_types=graph._edge_prop_types)

In [None]:
new_graph.nodes(raw_frame=True).sample(5)

In [None]:
print("Top 10 nodes by degree")
for n in new_graph.nodes(raw_frame=True).nlargest(10, columns=["degree"]).index:
    print("\t", n)

In [None]:
print("Top 10 nodes by PageRank")
for n in new_graph.nodes(raw_frame=True).nlargest(10, columns=["pagerank"]).index:
    print("\t", n)

In [None]:
print("Top 10 nodes by betweenness")
for n in new_graph.nodes(raw_frame=True).nlargest(10, columns=["betweenness"]).index:
    print("\t", n)

## VI. Community detection

_Community detection_ methods partition the graph into clusters of densely connected nodes in a way that nodes in the same community are more connected between themselves relatively to the nodes in different communities. In this section we will illustrate the use of the `CommunityDetector` interface provided by BlueGraph for community detection and estimation of its quality using modularity, performance and coverange methods. 

First, we create a `Neo4j`-based instance and use several different community detection strategies provided by Neo4j.

In [None]:
com_detector = Neo4jCommunityDetector.from_graph_object(graph_view)

### Louvain algorithm

In [None]:
partition = com_detector.detect_communities(
    strategy="louvain", weight="npmi")

In [None]:
print("Modularity: ", com_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", com_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", com_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))

### Label propagation

In [None]:
partition = com_detector.detect_communities(
    strategy="lpa", weight="npmi")

In [None]:
print("Modularity: ", com_detector.evaluate_parition(partition, metric="modularity", weight="npmi"))
print("Performance: ", com_detector.evaluate_parition(partition, metric="performance", weight="npmi"))
print("Coverage: ", com_detector.evaluate_parition(partition, metric="coverage", weight="npmi"))

### Writing community partition as node properties

In [None]:
com_detector.detect_communities(
    strategy="louvain", weight="npmi",
    write=True, write_property="louvain_community")

In [None]:
new_graph = com_detector.get_pgframe(
    node_prop_types=new_graph._node_prop_types,
    edge_prop_types=new_graph._edge_prop_types)

In [None]:
new_graph.nodes(raw_frame=True).sample(5)

## VII. Export network and the computed metrics

Save graph as JSON

In [None]:
new_graph.export_json("../data/nasa_comention.json")

Save the graph for Gephi import.

In [None]:
new_graph.export_to_gephi(
    "../data/gephi_nasa_comention", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the raw frequency of the co-occurrence edges, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

![alt text](./figures/nasa/full_network.png "NASA dataset keywords co-occurrence network")

We can zoom into some of the communities of keywords identified using the community detection method above

Community | Zoom
- | - 
Celestial bodies <img src="./figures/nasa/celestial_body_cluster.png" alt="Drawing" style="width: 400px;"/>|<img src="./figures/nasa/celestial_body_cluster_zoom.png" alt="Drawing" style="width: 400px;"/>
Earth science <img src="./figures/nasa/earth_science.png" alt="Drawing" style="width: 400px;"/>|<img src="./figures/nasa/earch_science_zoom.png" alt="Drawing" style="width: 400px;"/>
Space programs and missions <img src="./figures/nasa/programs_missions.png" alt="Drawing" style="width: 400px;"/>|<img src="./figures/nasa/programs_missions_zoom.png" alt="Drawing" style="width: 400px;"/>

## VIII. Minimum spanning tree

A _minimum spanning tree_ of a network is given by a subset of edges that make the network connected ($n - 1$ edges connecting $n$ nodes). Its weighted version minimizes not only the number of edges included in the tree, but the total edge weight.

In the following example we compute a minimum spanning tree minimizing the NPMI-based distance weight of the network edges. We use the Neo4j-based implementation of the `PathFinder` interface.

In [None]:
path_finder.minimum_spanning_tree(distance="distance_npmi", write=True, write_edge_label="MSTEdge")

In [None]:
new_graph._nodes.index.unique()

In [None]:
nx_path_finder = NXPathFinder(new_graph, directed=False)
tree = nx_path_finder.minimum_spanning_tree(distance="distance_npmi")

In [None]:
tree_pgframe = networkx_to_pgframe(
    tree,
    node_prop_types=new_graph._node_prop_types,
    edge_prop_types=new_graph._edge_prop_types)

In [None]:
tree_pgframe._nodes

In [None]:
tree_pgframe.export_to_gephi(
    "../data/gephi_nasa_spanning_tree", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

![alt text](./figures/nasa/tree.png "Minimum spanning tree")

Zoom Earth Science | Zoom Asteroids
-|-
![alt text](./figures/nasa/tree_zoom_1.png "Minimum spanning tree")|![alt text](./figures/nasa/tree_zoom_2.png "Minimum spanning tree")


## IX. Shortest path search

The _shortest path search problem_ consisits in finding a sequence of edges from the source node to the target node that minimizes the cumulative weight (or distance) associated to the edges. 

In [None]:
path = path_finder.shortest_path("ecosystems", "oceans")
pretty_print_paths([path])

The cell above illustrates that the single shortest path form 'ecosystems' and 'oceans' consists of the direct edge between them.

Now to explore related keywords we would like to find a _set_ of $n$ shortest paths between them. Moreover, we would like these paths to be _indirect_ (not to include the direct edge from the source to the target). In the following examples we use mutual-information-based edge weights to perform our literature exploration. 

In the following examples we use Yen's algorithm for finding $n$ loopless shortest paths from the source to the target (https://en.wikipedia.org/wiki/Yen%27s_algorithm).

In [None]:
paths = path_finder.n_shortest_paths(
    "ecosystems", "oceans", n=10,
    distance="distance_npmi",
    strategy="yen")

In [None]:
pretty_print_paths(paths)

In [None]:
paths = path_finder.n_shortest_paths(
    "mission", "mars", n=10,
    distance="distance_npmi",
    strategy="yen")

In [None]:
pretty_print_paths(paths)

## X. Nested path search

To explore the space of co-occurring terms in depth, we can run the path search procedure presented above in a _nested fashion_. For each edge $e_1, e_2, ..., e_n$ encountered on a path from the source to the target  from, we can
further expand it into $n$ shortest paths between each pair of successive entities (i.e. paths between $e_1$ and $e_2$, $e_2$ and $e_3$, etc.).

In [None]:
paths1 = path_finder.n_nested_shortest_paths(
    "ecosystems", "oceans",
    top_level_n=10, nested_n=3, depth=2, distance="distance_npmi",
    strategy="yen")

In [None]:
paths2 = path_finder.n_nested_shortest_paths(
    "mission", "mars",
    top_level_n=10, nested_n=3, depth=2, distance="distance_npmi",
    strategy="yen")

We can now build and visualize the subnetwork constructed using the nodes and the edges discovered during our nested path search.

In [None]:
summary_graph_oceans = networkx_to_pgframe(nx_path_finder.get_subgraph_from_paths(paths1))
summary_graph_mars = networkx_to_pgframe(nx_path_finder.get_subgraph_from_paths(paths2))

In [None]:
# Save the graph for Gephi import.
summary_graph_oceans.export_to_gephi(
    "../data/gephi_nasa_path_graph_oceans", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })
# Save the graph for Gephi import.
summary_graph_mars.export_to_gephi(
    "../data/gephi_nasa_path_graph_mars", 
    node_attr_mapping = {
        "degree": "Degree",
        "pagerank": "PageRank",
        "betweenness": "Betweenness",
        "louvain_community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The resulting graphs visualized with Gephi

Ecosystems <-> Oceans
<img src="./figures/nasa/path_graph_ocean.png" alt="NASA path graph" style="width: 800px;"/>

 Mission <-> Mars
<img src="./figures/nasa/path_graph_mars.png" alt="NASA path graph" style="width: 800px;"/>