# KG-Hub Tutorial 2 - Loading and Examining Knowledge Graphs

This walkthrough provides a basic introduction to loading KG-Hub graphs for basic analysis. It assumes you have already set up a KG-Hub project and have produced a merged graph, as in the Getting Started tutorial notebook. The graph should be in the `/data/merged/` directory, named `merged-kg.tar.gz`, and be in KGX TSV format.

If the merged graph is somewhere else, change the value for `merged_graph_path` below. Otherwise, just run that code block.

In [None]:
merged_graph_path = "../data/merged/merged-kg.tar.gz"

If you don't already have a graph and just want to dive in, run this next block. It will download a copy of the MONDO disease ontology graph from KG-OBO. This is not the most exciting input, but it's comparatively small and will still work in the following examples.

In [None]:
!wget https://kg-hub.berkeleybop.io/kg-obo/mondo/2022-08-01/mondo_kgx_tsv.tar.gz

In [None]:
merged_graph_path = "./mondo_kgx_tsv.tar.gz"

## Loading and processing graphs with GraPE

The [Graph Processing and Embedding (GraPE) package](https://github.com/AnacletoLAB/grape) is a comprehensive toolbox for loading, processing, describing, and otherwise learning from graphs. It has two primary components: Ensmallen, which handles graph processing, and Embiggen, which produces embeddings. Working with large, complex graphs can be very computationally intensive, so the GraPE tools use a variety of strategies to optimize efficiency. They also work very well with KG-Hub graphs!

[The full documentation for GraPE is here.](https://anacletolab.github.io/grape/index.html) You'll see that it offers a sizable collection of functions, so feel free to explore. There are also [tutorial notebooks](https://github.com/AnacletoLAB/grape/tree/main/tutorials) to peruse. For now, let's get GraPE ready, load a graph, and learn about its features.

First, install GraPE and a variety of other dependencies with `pip`:

In [None]:
%pip install grape -U

Every graph in Ensmallen is loaded as a `Graph` object, so we import that class (and `random`, because we'll use it later):

In [None]:
from grape import Graph
import random

Decompress the graph, as Ensmallen will expect separate node and edge files. If your node and edge filenames differ from the values for `merged_node_filename` and `merged_edge_filename` below, please change them. 

In [None]:
!tar xvzf $merged_graph_path

In [None]:
merged_node_filename = "merged-kg_nodes.tsv" # May need to change this to match the block above, like 'mondo_kgx_tsv_nodes.tsv'
merged_edge_filename = "merged-kg_edges.tsv" # Same here - this may be 'mondo_kgx_tsv_edges.tsv'

Load the graph with Ensmallen's `from_csv` (don't worry, we will tell it that these are tsv files, not csv):

In [None]:
a_big_graph = Graph.from_csv(
    node_path=merged_node_filename,
    edge_path=merged_edge_filename,
    node_list_separator="\t",
    edge_list_separator="\t",
    node_list_header=True,  # Always true for KG-Hub KGs
    edge_list_header=True,  # Always true for KG-Hub KGs
    nodes_column='id',  # Always true for KG-Hub KGs
    node_list_node_types_column='category',  # Always true for KG-Hub KGs
    sources_column='subject',  # Always true for KG-Hub KGs
    destinations_column='object',  # Always true for KG-Hub KGs
    directed=False,
    name="Big Graph of Nodes and Edges",
    verbose=True
)

a_big_graph

Great, now we've loaded a graph and have a very comprehensive idea of its contents, at least in terms of topology.

We can retrieve the total count of connected nodes (i.e., exclude all disconnected nodes from the count):

In [None]:
a_big_graph.get_number_of_connected_nodes()

We can also retrieve a random array of nodes to work with:

In [None]:
# This will output a numpy array.
# Set random_state to a specific value to get the same result reproducibly
random_int = random.randint(10000,99999)
some_nodes = a_big_graph.get_sorted_unique_random_nodes(number_of_nodes_to_sample=10, random_state=random_int)
some_nodes

The nodes are represented as integers for the sake of efficiency. If you'd prefer names, we can get those too:

In [None]:
all_node_names = []
for node_id in some_nodes:
    node_name = a_big_graph.get_node_name_from_node_id(node_id)
    all_node_names.append((node_id,node_name))
all_node_names

We can see how many neighbors each node has (i.e., its degree):

In [None]:
all_node_degrees = []
for node_id in some_nodes:
    node_degree = a_big_graph.get_node_degree_from_node_id(node_id)
    all_node_degrees.append((node_id,node_degree))
all_node_degrees

We may also retrieve node types, starting with the node ID numbers:

In [None]:
all_node_types = []
for node_id in some_nodes:
    one_node_type = a_big_graph.get_node_type_names_from_node_id(node_id)
    if one_node_type not in all_node_types:
        all_node_types.append(one_node_type)
all_node_types

One node may have multiple node types, delimited by pipe characters.

It's also entirely possible that there's only one node type in this set, usually `biolink:NamedThing`. That's almost the most generic class available in Biolink.

Let's collect some more metrics about this graph. Some of these will be redundant with the report generated above, but it's convenient to collect them individually.

In [None]:
graph_stats = {
    "node_count": a_big_graph.get_number_of_nodes(),
    "edge_count": a_big_graph.get_number_of_edges(),
    "connected_components": a_big_graph.get_number_of_connected_components(),
    "density": a_big_graph.get_density(),
    "singleton_count": a_big_graph.get_number_of_singleton_nodes(),
    "max_node_degree": a_big_graph.get_maximum_node_degree(),
    "mean_node_degree": a_big_graph.get_node_degrees_mean(),
    # This last one can take some time to complete,
    # so comment it out if you're impatient.
    "betweenness_centrality": a_big_graph.get_betweenness_centrality()
}
graph_stats

Let's see which nodes are singletons.

In [None]:
a_big_graph.get_singleton_node_names()

We may wish to trim these or other nodes and edges from the graph, whether it's because they just add noise or because they prevent downstream analyses from working properly (for example, disconnected nodes aren't compatible with many graph embedding methods). Here's how we do that:

In [None]:
before = a_big_graph.get_number_of_nodes()
a_trimmed_graph = a_big_graph.remove_disconnected_nodes()
a_trimmed_graph = a_trimmed_graph.remove_selfloops()
# The next line will remove all but the largest connected component.
a_trimmed_graph = a_trimmed_graph.remove_components(top_k_components=1)
after = a_trimmed_graph.get_number_of_nodes()
print(f"Nodes removed: {before - after}")

Is the new graph truly a subgraph of the original? We can verify that:

In [None]:
# If this is True, all nodes and edges 
# in a_trimmed_graph are also in a_big_graph.
a_big_graph.contains(a_trimmed_graph)

For more graph fun, see the next tutorials, 'Link Prediction and More Graph Machine Learning' and 'Automated Machine Learning with NEAT'.