# NASA dataset keywords analysis

In this notebook we use network-based co-occurrence analysis on the publicly available Data catalog of NASA (https://data.nasa.gov/browse, and the API endpoint https://data.nasa.gov/data.json). This dataset consists of the meta-data for different NASA datasets. 

We will work on the sets of keywords attached to each dataset and build a keyword co-occurrence network describing relations between different dataset keywords. The keyword relations in the above-mentioned networks are quantified using mutual-information-based scores (pointwise mutual information and its normalized version).

See the related tutorial here: https://www.tidytextmining.com/nasa.html

In [1]:
import json
import pandas as pd
import networkx as nx

In [2]:
from kganalytics.network_generation import generate_cooccurrence_network

from kganalytics.metrics import (compute_degree_centrality,
                               compute_pagerank_centrality,
                               compute_betweenness_centrality,
                               detect_communities,
                               compute_all_metrics)
from kganalytics.export import (save_network,
                              save_to_gephi)
from kganalytics.paths import (minimum_spanning_tree,
                             top_n_paths,
                             top_n_tripaths,
                             single_shortest_path,
                             top_n_nested_paths,
                             graph_from_paths,
                             pretty_print_paths,
                             pretty_print_tripaths,
                             top_neighbors)
from kganalytics.utils import subgraph_by_types

## Data preparation

In [3]:
with open("data/nasa.json", "r") as f:
    data = json.load(f)

In [4]:
print("Example dataset: ")
print("----------------")
print(json.dumps(data["dataset"][0], indent="   "))

print()
print("Keywords: ", data["dataset"][0]["keyword"])

Example dataset: 
----------------
{
   "accessLevel": "public",
   "landingPage": "https://pds.nasa.gov/ds-view/pds/viewDataset.jsp?dsid=RO-E-RPCMAG-2-EAR2-RAW-V3.0",
   "bureauCode": [
      "026:00"
   ],
   "issued": "2018-06-26",
   "@type": "dcat:Dataset",
   "modified": "2020-03-04",
   "references": [
      "https://pds.nasa.gov"
   ],
   "keyword": [
      "international rosetta mission",
      "earth",
      "unknown"
   ],
   "contactPoint": {
      "@type": "vcard:Contact",
      "fn": "Thomas Morgan",
      "hasEmail": "mailto:thomas.h.morgan@nasa.gov"
   },
   "publisher": {
      "@type": "org:Organization",
      "name": "National Aeronautics and Space Administration"
   },
   "identifier": "urn:nasa:pds:context_pds3:data_set:data_set.ro-e-rpcmag-2-ear2-raw-v3.0",
   "description": "This dataset contains EDITED RAW DATA of the second Earth Flyby (EAR2). The closest approach (CA) took place on November 13, 2007 at 20:57",
   "title": "ROSETTA-ORBITER EARTH RPCMAG 2 EAR2 

We create a dataframe with keyword occurrence in different datasets

In [5]:
rows = []
for el in data['dataset']:
    row = [el["identifier"]]
    if "keyword" in el:
        for k in el["keyword"]:
            rows.append(row + [k])
keyword_data = pd.DataFrame(rows, columns=["dataset", "keyword"])

In [6]:
keyword_data

Unnamed: 0,dataset,keyword
0,urn:nasa:pds:context_pds3:data_set:data_set.ro...,international rosetta mission
1,urn:nasa:pds:context_pds3:data_set:data_set.ro...,earth
2,urn:nasa:pds:context_pds3:data_set:data_set.ro...,unknown
3,C1973352326-GHRC_CLOUD,earth science
4,C1973352326-GHRC_CLOUD,atmosphere
...,...,...
110742,NASA-877__2,apollo
110743,NASA-877__2,catalog
110744,NASA-877__2,lunar
110745,TECHPORT_94299,active


In [7]:
keyword_occurrence = keyword_data.groupby("keyword").aggregate(set)

In [8]:
keyword_occurrence["frequency"] = keyword_occurrence.dataset.apply(len)

In [9]:
print("20 most frequent keywords:\n")
for k in keyword_occurrence.nlargest(20, "frequency").index:
    print("\t", k)

20 most frequent keywords:

	 completed
	 earth science
	 atmosphere
	 national geospatial data asset
	 ngda
	 active
	 land surface
	 oceans
	 goddard space flight center
	 glenn research center
	 langley research center
	 spectral/engineering
	 jet propulsion laboratory
	 ames research center
	 johnson space center
	 biosphere
	 atmospheric water vapor
	 atmospheric radiation
	 marshall space flight center
	 atmospheric temperature


In [10]:
n_datasets = len(keyword_data.dataset.unique())

In [11]:
print("Total number of dataset: ", n_datasets)

Total number of dataset:  25379


We create a co-occurrence network using the 1000 most frequent keywords: nodes are keywords and a pair of nodes is connected with an undirected edge if two corresponding keywords co-occur in at lease one dataset. Moreover, the edges are equipped with weights corresponding to:

- raw co-occurrence frequency
- normalized pointwise mutual information (NPMI)
- frequency- and mutual-information-based distances (1 / frequency, 1 / NPMI)

## Network generation

In [12]:
comention_network = generate_cooccurrence_network(
    keyword_occurrence, "dataset", n_datasets,
    n_most_frequent=1000,
    parallelize=True)

Fitering data.....
Selected 1000 most frequent terms
Examining 499500 pairs of terms for co-occurrence...
Generated 55442 edges                    
Created a co-occurrence graph:
	number of nodes:  1000
	number of edges:  55442
Saving the edges...
Creating a graph object...


In [13]:
print("Density of the constructed network: ", nx.density(comention_network))

Density of the constructed network:  0.11099499499499499


## Nearest neighours by NPMI

We can compute top 10 neighbors of the keywords 'mars' and 'saturn' by the highest NPMI.

In [14]:
top_neighbors(comention_network, "mars", 10, weight="npmi")

{'mars exploration rover': 0.7728438202168381,
 'phoenix': 0.6460357628160349,
 'mars science laboratory': 0.63469604531258,
 '2001 mars odyssey': 0.5894573604880277,
 'mars global surveyor': 0.5748151339080378,
 'mars reconnaissance orbiter': 0.5545541611495696,
 'viking': 0.5482308663606784,
 'mars pathfinder': 0.5215265293101519,
 'mars express': 0.5144439736810263,
 'phobos': 0.4919182437892427}

In [15]:
top_neighbors(comention_network, "saturn", 10, weight="npmi")

{'iapetus': 0.750756546067034,
 'tethys': 0.7500852737948767,
 'mimas': 0.7475544979182485,
 'phoebe': 0.7452819643004637,
 'rhea': 0.7447845472216457,
 'dione': 0.7420242607102145,
 'cassini-huygens': 0.7415350493617368,
 'enceladus': 0.734155283618827,
 'hyperion': 0.7340306290661932,
 'janus': 0.7138228927549277}

## Node centrality measures

We compute various centrality metrics (degree, PageRank, betweenness) to find the most important keywords accross the entire data corpus.

In [16]:
_ = compute_degree_centrality(comention_network, ["frequency"], 20)

Top n nodes by frequency:
	earth science (37219)
	atmosphere (18258)
	national geospatial data asset (16742)
	ngda (16742)
	land surface (12152)
	oceans (9404)
	completed (8786)
	atmospheric water vapor (8248)
	atmospheric temperature (7480)
	atmospheric radiation (7402)
	biosphere (7057)
	spectral/engineering (6245)
	precipitation (5923)
	clouds (5394)
	atmospheric chemistry (4962)
	ocean temperature (4914)
	vegetation (4684)
	atmospheric pressure (4471)
	terrestrial hydrosphere (4244)
	cryosphere (4222)



In [17]:
_ = compute_pagerank_centrality(comention_network, ["frequency"], 20)

Top n nodes by frequency:
	earth science (0.02)
	completed (0.02)
	atmosphere (0.01)
	active (0.01)
	national geospatial data asset (0.01)
	ngda (0.01)
	pds (0.01)
	land surface (0.01)
	spice (0.00)
	oceans (0.00)
	labeling (0.00)
	space science (0.00)
	jupiter (0.00)
	calibration (0.00)
	mars (0.00)
	earth (0.00)
	atmospheric water vapor (0.00)
	biosphere (0.00)
	saturn (0.00)
	spectral/engineering (0.00)



In [18]:
_ = compute_betweenness_centrality(
    comention_network, ["distance_npmi"], 20)

Top n nodes by distance_npmi:
	10199 chariklo (0.12936249542795972)
	nix (0.0777490917771479)
	ida (0.07718941386276056)
	ceres (0.07389353281136848)
	planetary science (0.07308510915725344)
	project (0.07044118266563157)
	atlas (0.0632516283818889)
	safety (0.06299485858604095)
	management (0.059841404530783286)
	active (0.05961472494538627)
	star (0.054409118537375054)
	imagery (0.054409118537375054)
	iss (0.051807318340384476)
	operations (0.04893570925635054)
	space (0.048007927767446806)
	time (0.04567092142242443)
	working group for planetary system nomenclature (0.044734112870385416)
	astronomy (0.044442237829011376)
	2p/encke 1 (1818 w1) (0.044327895029297834)
	gsfc (0.044102218450915845)



## Community detection

We perform community detection to identify clusters of densely connected nodes (keywords representing strongly associated concepts). Such community detection is performed using two different edge weights: raw frequency and NPMI.

In [19]:
_ = detect_communities(comention_network, weight="frequency", set_attr="community")
_ = detect_communities(comention_network, weight="npmi", set_attr="community_npmi")

Detecting communities...
Best network partition:
	 Number of communities: 13
	 Modularity: 0.6383457625721062
Detecting communities...
Best network partition:
	 Number of communities: 7
	 Modularity: 0.16918879059464034


The output of the cell above suggest that, in our example, the community separation is better when using the raw frequency rather than NPMI (as the modularity value is higher).

Save the graph for Gephi import.

In [20]:
save_to_gephi(
    comention_network, "data/gephi_nasa_comention", 
    node_attr_mapping = {
        "degree_frequency": "Degree",
        "pagerank_frequency": "PageRank",
        "betweenness_distance_npmi": "Betweenness",
        "community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The representation of the network saved above can be imported into Gephi for producing graph visualizations, as in the following example:

In the figures below colors represent communities detected using the raw frequency of the co-occurrence edges, node sizes are proportional to the PageRank of nodes and edge thickness to the NPMI values.

![alt text](./figures/nasa/full_network.png "NASA dataset keywords co-occurrence network")

We can zoom into some of the communities of keywords identified using the community detection method above

Community | Zoom
- | - 
Celestial bodies <img src="./figures/nasa/celestial_body_cluster.png" alt="Drawing" style="width: 400px;"/>|<img src="./figures/nasa/celestial_body_cluster_zoom.png" alt="Drawing" style="width: 400px;"/>
Earth science <img src="./figures/nasa/earth_science.png" alt="Drawing" style="width: 400px;"/>|<img src="./figures/nasa/earch_science_zoom.png" alt="Drawing" style="width: 400px;"/>
Space programs and missions <img src="./figures/nasa/programs_missions.png" alt="Drawing" style="width: 400px;"/>|<img src="./figures/nasa/programs_missions_zoom.png" alt="Drawing" style="width: 400px;"/>

## Minimum spanning tree

A _minimum spanning tree_ of a network is given by a subset of edges that make the network connected ($n - 1$ edges connecting $n$ nodes). Its weighted version minimizes not only the number of edges included in the tree, but the total edge weight.

In the following example we compute a minimum spanning tree minimizing the NPMI-based distance weight of the network edges.

In [21]:
tree = minimum_spanning_tree(comention_network, weight="distance_npmi")

In [22]:
save_to_gephi(
    tree, "data/gephi_nasa_spanning_tree", 
    node_attr_mapping = {
        "degree_frequency": "Degree",
        "pagerank_frequency": "PageRank",
        "betweenness_distance_npmi": "Betweenness",
        "community": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

![alt text](./figures/nasa/tree.png "Minimum spanning tree")

Zoom Earth Science | Zoom Asteroids
-|-
![alt text](./figures/nasa/tree_zoom_1.png "Minimum spanning tree")|![alt text](./figures/nasa/tree_zoom_2.png "Minimum spanning tree")


## Shortest path search

The _shortest path search problem_ consisits in finding a sequence of edges from the source node to the target node that minimizes the cumulative weight (or distance) associated to the edges. 

In [23]:
path = single_shortest_path(comention_network, "ecosystems", "oceans")
pretty_print_paths([path])

ecosystems <->  <-> oceans
               


The cell above illustrates that the single shortest path form 'ecosystems' and 'oceans' consists of the direct edge between them.

Now to explore related keywords we would like to find a _set_ of $n$ shortest paths between them. Moreover, we would like these paths to be _indirect_ (not to include the direct edge from the source to the target). In the following examples we use mutual-information-based edge weights to perform our literature exploration. 

In the following examples we use Yen's algorithm for finding $n$ loopless shortest paths from the source to the target (https://en.wikipedia.org/wiki/Yen%27s_algorithm).

In [24]:
paths = top_n_paths(
    comention_network, "ecosystems", "oceans", n=10,
    distance="distance_npmi",
    strategy="yen")

In [25]:
pretty_print_paths(paths)

ecosystems <->                                                      <-> oceans
               geomorphic landforms/processes
               aquatic ecosystems
               biosphere <-> coastal processes
               biosphere <-> ocean waves
               biosphere <-> terrestrial ecosystems
               geomorphic landforms/processes <-> coastal processes
               bathymetry/seafloor topography
               geomorphic landforms/processes <-> ocean winds
               geomorphic landforms/processes <-> ocean waves
               aquatic ecosystems <-> ocean optics


In [26]:
paths = top_n_paths(
    comention_network, "mission", "mars", n=10,
    distance="distance_npmi",
    strategy="yen")

In [27]:
pretty_print_paths(paths)

mission <->                                                         <-> mars
            pegasus <-> mars reconnaissance orbiter
            pegasus <-> phoenix
            delta <-> mars reconnaissance orbiter
            earth's bridge to space <-> mars reconnaissance orbiter
            vehicle <-> mars reconnaissance orbiter
            mars reconnaissance orbiter
            history <-> mars reconnaissance orbiter
            support <-> mars reconnaissance orbiter
            landing <-> mars reconnaissance orbiter
            delta <-> phoenix


## Nested path search

To explore the space of co-occurring terms in depth, we can run the path search procedure presented above in a _nested fashion_. For each edge $e_1, e_2, ..., e_n$ encountered on a path from the source to the target  from, we can
further expand it into $n$ shortest paths between each pair of successive entities (i.e. paths between $e_1$ and $e_2$, $e_2$ and $e_3$, etc.).

In [28]:
paths1 = top_n_nested_paths(
    comention_network, "ecosystems", "oceans",
    n=10, nested_n=3, depth=2, distance="distance_npmi",
    strategy="naive")

In [29]:
paths2 = top_n_nested_paths(
    comention_network, "mission", "mars",
    n=10, nested_n=3, depth=2, distance="distance_npmi",
    strategy="naive")

We can now build and visualize the subnetwork constructed using the nodes and the edges discovered during our nested path search.

In [30]:
summary_graph1 = graph_from_paths(paths1, source_graph=comention_network)
summary_graph2 = graph_from_paths(paths2, source_graph=comention_network)

In [31]:
# Save the graph for Gephi import.
save_to_gephi(
    summary_graph1, "data/gephi_nasa_path_graph_oceans", 
    node_attr_mapping = {
        "degree_frequency": "Degree",
        "pagerank_frequency": "PageRank",
        "betweenness_distance_npmi": "Betweenness",
        "community_npmi": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })
# Save the graph for Gephi import.
save_to_gephi(
    summary_graph2, "data/gephi_nasa_path_graph_mission", 
    node_attr_mapping = {
        "degree_frequency": "Degree",
        "pagerank_frequency": "PageRank",
        "betweenness_distance_npmi": "Betweenness",
        "community_npmi": "Community"
    },
    edge_attr_mapping={
        "npmi": "Weight"
    })

The resulting graphs visualized with Gephi

Ecosystems <-> Oceans
<img src="./figures/nasa/path_graph_ocean.png" alt="NASA path graph" style="width: 800px;"/>

 Mission <-> Mars
<img src="./figures/nasa/path_graph_mars.png" alt="NASA path graph" style="width: 800px;"/>