# KG-Hub: Machine Learning on Knowledge Graphs

This walkthrough provides a basic introduction to preparing KG-Hub projects for graph-based machine learning and analysis. It assumes you have already set up a KG-Hub project and have produced a merged graph. The graph should be in the `/data/merged/` directory, named `merged-kg.tar.gz`, and be in KGX TSV format.

If the merged graph is somewhere else, change the value for `merged_graph_path` below. Otherwise, just run that code block.

In [3]:
merged_graph_path = "../data/merged/merged-kg.tar.gz"

If you don't already have a graph and just want to dive in, run this next block. It will download a copy of the MONDO disease ontology graph from KG-OBO. This is not the most exciting input, but it's comparatively small and will still work in the following examples.

In [4]:
!wget https://kg-hub.berkeleybop.io/kg-obo/mondo/2022-02-04/mondo_kgx_tsv.tar.gz

--2022-06-28 15:49:31--  https://kg-hub.berkeleybop.io/kg-obo/mondo/2022-02-04/mondo_kgx_tsv.tar.gz
Resolving kg-hub.berkeleybop.io (kg-hub.berkeleybop.io)... 52.85.151.29, 52.85.151.90, 52.85.151.118, ...
Connecting to kg-hub.berkeleybop.io (kg-hub.berkeleybop.io)|52.85.151.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8420745 (8.0M) [plain/text]
Saving to: ‘mondo_kgx_tsv.tar.gz’


2022-06-28 15:49:32 (17.7 MB/s) - ‘mondo_kgx_tsv.tar.gz’ saved [8420745/8420745]



In [5]:
merged_graph_path = "./mondo_kgx_tsv.tar.gz"

## Loading and processing graphs with GraPE

The [Graph Processing and Embedding (GraPE) package](https://github.com/AnacletoLAB/grape) is a comprehensive toolbox for loading, processing, describing, and otherwise learning from graphs. It has two primary components: Ensmallen, which handles graph processing, and Embiggen, which produces embeddings. Working with large, complex graphs can be very computationally intensive, so the GraPE tools use a variety of strategies to optimize efficiency. They also work very well with KG-Hub graphs!

[The full documentation for GraPE is here.](https://anacletolab.github.io/grape/index.html) You'll see that it offers a sizable collection of functions, so feel free to explore. There are also [tutorial notebooks](https://github.com/AnacletoLAB/grape/tree/main/tutorials) to peruse. For now, let's get GraPE ready, load a graph, and learn about its features.

First, install GraPE and a variety of other dependencies with `pip`:

In [6]:
%pip install grape -U



You should consider upgrading via the '/home/harry/kg-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


Every graph in Ensmallen is loaded as a `Graph` object, so we import that class (and `random`, because we'll use it later):

In [7]:
from grape import Graph
import random

Decompress the graph, as Ensmallen will expect separate node and edge files. If your node and edge filenames differ from the values for `merged_node_filename` and `merged_edge_filename` below, please change them. 

In [None]:
!tar xvzf $merged_graph_path

In [9]:
merged_node_filename = "merged-kg_nodes.tsv"
merged_edge_filename = "merged-kg_edges.tsv"

Load the graph with Ensmallen's `from_csv` (don't worry, we will tell it that these are tsv files, not csv):

In [11]:
a_big_graph = Graph.from_csv(
    node_path=merged_node_filename,
    edge_path=merged_edge_filename,
    node_list_separator="\t",
    edge_list_separator="\t",
    node_list_header=True,  # Always true for KG-Hub KGs
    edge_list_header=True,  # Always true for KG-Hub KGs
    nodes_column='id',  # Always true for KG-Hub KGs
    node_list_node_types_column='category',  # Always true for KG-Hub KGs
    sources_column='subject',  # Always true for KG-Hub KGs
    destinations_column='object',  # Always true for KG-Hub KGs
    directed=False,
    name="A_Big_Graph",
    verbose=True
)

a_big_graph

Great, now we've loaded a graph and have some general ideas about its contents.

We can retrieve the total count of connected nodes (i.e., exclude all disconnected nodes from the count):

In [13]:
a_big_graph.get_number_of_connected_nodes()

232544

We can also retrieve a random array of nodes to work with:

In [22]:
# This will output a numpy array.
# Set random_state to a specific value to get the same result reproducibly
random_int = random.randint(10000,99999)
some_nodes = a_big_graph.get_sorted_unique_random_nodes(number_of_nodes_to_sample=10, random_state=random_int)
some_nodes

array([ 15841,  27226,  61210,  88689, 114986, 138832, 157016, 163569,
       209045, 222410], dtype=uint32)

The nodes are represented as integers for the sake of efficiency. If you'd prefer names, we can get those too:

In [23]:
all_node_names = []
for node_id in some_nodes:
    node_name = a_big_graph.get_node_name_from_node_id(node_id)
    all_node_names.append((node_id,node_name))
all_node_names

[(15841, 'CHEBI:174346'),
 (27226, 'FOODON:03307069'),
 (61210, 'CHEBI:57268'),
 (88689, 'CHEBI:77910'),
 (114986, 'FOODON:03302184'),
 (138832, 'REACT:R-RNO-416993'),
 (157016, 'CHEBI:70601'),
 (163569, 'CHEBI:111383'),
 (209045, 'CHEBI:178705'),
 (222410, 'CHEBI:184825')]

We can see how many neighbors each node has (i.e., its degree):

In [24]:
all_node_degrees = []
for node_id in some_nodes:
    node_degree = a_big_graph.get_node_degree_from_node_id(node_id)
    all_node_degrees.append((node_id,node_degree))
all_node_degrees

[(15841, 2),
 (27226, 2),
 (61210, 3),
 (88689, 4),
 (114986, 2),
 (138832, 7),
 (157016, 3),
 (163569, 3),
 (209045, 2),
 (222410, 2)]

We may also retrieve node types, starting with the node ID numbers:

In [25]:
all_node_types = []
for node_id in some_nodes:
    one_node_type = a_big_graph.get_node_type_names_from_node_id(node_id)
    if one_node_type not in all_node_types:
        all_node_types.append(one_node_type)
all_node_types

[['biolink:NamedThing'],
 ['biolink:BiologicalProcessOrActivity|biolink:BiologicalProcess|biolink:NamedThing|biolink:Pathway']]

One node may have multiple node types, delimited by pipe characters.

Finally, let's complete a task in preparation for the next section: assembling holdout data and sets of negative edges.

In [27]:
# Generate and save an 80/20 training/validation split of the edges in the input graph.
train_edge_path = merged_edge_filename + ".train"
valid_edge_path = merged_edge_filename + ".valid"

train_edge_graph, valid_edge_graph = a_big_graph.random_holdout(train_size=0.8)
train_edge_graph.dump_edges(train_edge_path, edge_type_column='predicate')
valid_edge_graph.dump_edges(valid_edge_path, edge_type_column='predicate')

In [31]:
# Now the graph of negatives.
negative_graph = a_big_graph.sample_negative_graph(a_big_graph.get_number_of_edges()) # Just as many negative examples as positive examples
negative_graph = negative_graph.remove_disconnected_nodes()
negative_graph

In [33]:
# As above, this will save training and validation edge lists.
neg_train_edge_path = merged_edge_filename + ".neg_train"
neg_valid_edge_path = merged_edge_filename + ".neg_valid"

neg_train_edge_graph, neg_valid_edge_graph = negative_graph.random_holdout(train_size=0.8)
neg_train_edge_graph.dump_edges(neg_train_edge_path, edge_type_column='predicate')
neg_valid_edge_graph.dump_edges(neg_valid_edge_path, edge_type_column='predicate')

## Generating embeddings and building classifiers with NEAT

The [NEAT](https://github.com/Knowledge-Graph-Hub/neat-ml) package provides a way to define graph machine learning tasks with a single configuration file. We'll generate such a file here, then run NEAT to produce embeddings and a link prediction classifier.

We'll start by defining some basic parameters, largely based on what we did in the previous section.

In [56]:
%pip install neat-ml -U
%pip install scikit-learn

Collecting argparse>=1.4.0
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)




Installing collected packages: argparse
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kg-obo 0.1 requires kgx==1.5.1, but you have kgx 1.5.7 which is incompatible.[0m[31m
[0mSuccessfully installed argparse-1.4.0
You should consider upgrading via the '/home/harry/kg-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/home/harry/kg-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [41]:
directed = False # Yes, this is technically a directed network, but we'll treat it as undirected
node_path = merged_node_filename # Positive training nodes
edge_path = train_edge_path # Positive training edges
#valid_edge_path - we've already defined this above
#neg_train_edge_path - we've already defined this above
#neg_valid_edge_path - we've already defined this above

# Embedding parameters
embedding_file_name = "embeddings.tsv"
embedding_history_file_name = "embedding_history.json"
node_embedding_method_name = "SPINE"

# Classifier parameters - NEAT can build multiple classifier types in one run, if specified in the configuration file
edge_method = "Average" # one of EdgeTransformer.methods: Hadamard, Sum, Average, L1, AbsoluteL1, L2, or alternatively a lambda
classifier_type = "Logistic Regression"
classifier_model_outfile = "model_lr.model"
classifier_model_type = "sklearn.linear_model.LogisticRegression"
classifier_model_random_state = 42
classifier_model_max_iter = 1000

# Output parameters
output_directory = "./"
config_filename = "scallops.yaml"

In [52]:
outstring = f"""
---
Target:
  target_path: {output_directory}

GraphDataConfiguration:
  graph:
    directed: {directed}
    node_path: {node_path}
    edge_path: {edge_path}
    verbose: True
    nodes_column: "id"
    node_list_node_types_column: "category"
    default_node_type: "biolink:NamedThing"
    sources_column: "subject"
    destinations_column: "object"
    default_edge_type: "biolink:related_to"
  evaluation_data:
    valid_data:
      pos_edge_filepath: {valid_edge_path}
      neg_edge_filepath: {neg_valid_edge_path}
    train_data:
      neg_edge_filepath: {neg_train_edge_path}

EmbeddingsConfig:
  filename: {embedding_file_name}
  history_filename: {embedding_history_file_name}
  node_embeddings_params:
    method_name: {node_embedding_method_name}
  tsne_filename: tsne.png

ClassifierContainer:
  classifiers:
    - classifier_id: lr_1
      classifier_name: {classifier_type}
      classifier_type: {classifier_model_type}
      edge_method: {edge_method}
      outfile: {classifier_model_outfile}
      parameters:
        sklearn_params:
          random_state: {classifier_model_random_state}
          max_iter: {classifier_model_max_iter}

ApplyTrainedModelsContainer:
  models:
    - model_id: lr_1
      node_types:
        source:
          - "biolink:NamedThing"
        destination:
          - "biolink:NamedThing"
      cutoff: 0.9
      outfile: lr_protein_predictions.tsv
"""
print(outstring)
with open(config_filename, "w") as outfile:
    outfile.write(outstring)


---
Target:
  target_path: ./

GraphDataConfiguration:
  graph:
    directed: False
    node_path: merged-kg_nodes.tsv
    edge_path: merged-kg_edges.tsv.train
    verbose: True
    nodes_column: "id"
    node_list_node_types_column: "category"
    default_node_type: "biolink:NamedThing"
    sources_column: "subject"
    destinations_column: "object"
    default_edge_type: "biolink:related_to"
  evaluation_data:
    valid_data:
      pos_edge_filepath: merged-kg_edges.tsv.valid
      neg_edge_filepath: merged-kg_edges.tsv.neg_valid
    train_data:
      neg_edge_filepath: merged-kg_edges.tsv.neg_train

EmbeddingsConfig:
  filename: embeddings.tsv
  history_filename: embedding_history.json
  node_embeddings_params:
    method_name: SPINE
  tsne_filename: tsne.png

ClassifierContainer:
  classifiers:
    - classifier_id: lr_1
      classifier_name: Logistic Regression
      classifier_type: sklearn.linear_model.LogisticRegression
      edge_method: Average
      outfile: model_lr.model


In [55]:
!neat run --config $config_filename

--2022-06-28 16:20:36--  https://raw.githubusercontent.com/Knowledge-Graph-Hub/neat-ml/main/tests/resources/test.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4294 (4.2K) [text/plain]
Saving to: ‘test.yaml.1’


2022-06-28 16:20:36 (1.23 MB/s) - ‘test.yaml.1’ saved [4294/4294]

/home/harry/kg-env/bin/neat
Traceback (most recent call last):
  File "/home/harry/kg-env/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/harry/kg-env/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    

In [None]:
from IPython.display import Image
Image(filename='tsne.png')