# KG-Hub: Machine Learning on Knowledge Graphs

This walkthrough provides a basic introduction to preparing KG-Hub projects for graph-based machine learning and analysis. It assumes you have already set up a KG-Hub project and have produced a merged graph. The graph should be in the `/data/merged/` directory, named `merged-kg.tar.gz`, and be in KGX TSV format.

If the merged graph is somewhere else, change the value for `merged_graph_path` below. Otherwise, just run that code block.

In [10]:
merged_graph_path = "../data/merged/merged-kg.tar.gz"

If you don't already have a graph and just want to dive in, run this next block. It will download a copy of the MONDO disease ontology graph from KG-OBO. This is not the most exciting input, but it's small and will still work in the following examples.

In [None]:
!wget https://kg-hub.berkeleybop.io/kg-obo/mondo/2022-02-04/mondo_kgx_tsv.tar.gz
merged_graph_path = "./mondo_kgx_tsv.tar.gz"

## Loading and processing graphs with GraPE

The [Graph Processing and Embedding (GraPE) package](https://github.com/AnacletoLAB/grape) is a comprehensive toolbox for loading, processing, describing, and otherwise learning from graphs. It has two primary components: Ensmallen, which handles graph processing, and Embiggen, which produces embeddings. Working with large, complex graphs can be very computationally intensive, so the GraPE tools use a variety of strategies to optimize efficiency. They also work very well with KG-Hub graphs!

[The full documentation for GraPE is here.](https://anacletolab.github.io/grape/index.html) You'll see that it offers a sizable collection of functions, so feel free to explore. There are also [tutorial notebooks](https://github.com/AnacletoLAB/grape/tree/main/tutorials) to peruse. For now, let's get GraPE ready, load a graph, and learn about its features.

First, install GraPE with `pip`:

In [4]:
!pip install grape



You should consider upgrading via the '/home/harry/kg-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

Every graph in Ensmallen is loaded as a `Graph` object, so we import that class:

In [5]:
from ensmallen import Graph

Decompress the graph, as Ensmallen will expect separate node and edge files. If your node and edge filenames differ from the values for `merged_node_filename` and `merged_edge_filename` below, please change them. 

In [11]:
%%bash -s "$merged_graph_path"
tar -xvf $1

merged-kg_nodes.tsv
merged-kg_edges.tsv


In [13]:
merged_node_filename = "merged-kg_nodes.tsv"
merged_edge_filename = "merged-kg_edges.tsv"

Load the graph with Ensmallen's `from_csv` (don't worry, we will tell it that these are tsv files, not csv):

In [16]:
a_big_graph = Graph.from_csv(
    node_path=merged_node_filename,
    edge_path=merged_edge_filename,
    node_list_separator="\t",
    edge_list_separator="\t",
    node_list_header=True,  # Always true for KG-Hub KGs
    edge_list_header=True,  # Always true for KG-Hub KGs
    nodes_column='id',  # Always true for KG-Hub KGs
    node_list_node_types_column='category',  # Always true for KG-Hub KGs
    sources_column='subject',  # Always true for KG-Hub KGs
    destinations_column='object',  # Always true for KG-Hub KGs
    directed=False,
    name="A_Big_Graph",
    verbose=True
)

a_big_graph

Great, now we've loaded a graph and have some general ideas about its contents.

We can retrieve the total count of connected nodes (i.e., exclude all disconnected nodes from the count):

In [18]:
a_big_graph.get_connected_nodes_number()

177716

In [19]:
a_big_graph.get_maximum_node_degree()

7344

## Embeddings and basic ML approaches w/ NEAT

* Intro to NEAT
* Making a NEAT config