# KG-Hub Tutorial 4 - Automated Machine Learning with NEAT

This walkthrough descibes how to automate the process of graph ML with the NEAT package. It assumes you have already set up a KG-Hub project and have produced a merged graph, as in the Getting Started tutorial notebook. The graph should be in the `/data/merged/` directory, named `merged-kg.tar.gz`, and be in KGX TSV format.

If the merged graph is somewhere else, change the value for `merged_graph_path` below. Otherwise, just run that code block.

In [None]:
merged_graph_path = "../data/merged/merged-kg.tar.gz"

If you don't already have a graph and just want to dive in, run this next block. It will download a copy of the MONDO disease ontology graph from KG-OBO. This is not the most exciting input, but it's comparatively small and will still work in the following examples.

In [None]:
!wget https://kg-hub.berkeleybop.io/kg-obo/mondo/2022-08-01/mondo_kgx_tsv.tar.gz

In [None]:
merged_graph_path = "./mondo_kgx_tsv.tar.gz"

## Loading the graph

First, install GraPE and a variety of other dependencies with `pip`:

In [None]:
%pip install grape -U

In [None]:
from grape import Graph

Decompress the graph, as Ensmallen will expect separate node and edge files. If your node and edge filenames differ from the values for `merged_node_filename` and `merged_edge_filename` below, please change them. 

In [None]:
!tar xvzf $merged_graph_path

In [None]:
merged_node_filename = "merged-kg_nodes.tsv" # May need to change this to match the block above, like 'mondo_kgx_tsv_nodes.tsv'
merged_edge_filename = "merged-kg_edges.tsv" # Same here - this may be 'mondo_kgx_tsv_edges.tsv'

Load the graph with Ensmallen's `from_csv`:

In [None]:
a_graph = Graph.from_csv(
    node_path=merged_node_filename,
    edge_path=merged_edge_filename,
    node_list_separator="\t",
    edge_list_separator="\t",
    node_list_header=True,  # Always true for KG-Hub KGs
    edge_list_header=True,  # Always true for KG-Hub KGs
    nodes_column='id',  # Always true for KG-Hub KGs
    node_list_node_types_column='category',  # Always true for KG-Hub KGs
    sources_column='subject',  # Always true for KG-Hub KGs
    destinations_column='object',  # Always true for KG-Hub KGs
    directed=False,
    name="Apple Fritters",
    verbose=True
)

a_graph

We can prepare some training and validation graphs now.

In [None]:
# Generate and save an 80/20 training/validation split of the edges in the input graph.
train_edge_path = merged_edge_filename + ".train"
valid_edge_path = merged_edge_filename + ".valid"

train_edge_graph, valid_edge_graph = a_big_graph.random_holdout(train_size=0.8)
train_edge_graph.dump_edges(train_edge_path, edge_type_column='predicate')
valid_edge_graph.dump_edges(valid_edge_path, edge_type_column='predicate')

In [None]:
# Now the graph of negatives.
negative_graph = a_big_graph.sample_negative_graph(a_big_graph.get_number_of_edges()) # Just as many negative examples as positive examples
negative_graph = negative_graph.remove_disconnected_nodes()

In [None]:
# As above, this will save training and validation edge lists.
neg_train_edge_path = merged_edge_filename + ".neg_train"
neg_valid_edge_path = merged_edge_filename + ".neg_valid"

neg_train_edge_graph, neg_valid_edge_graph = negative_graph.random_holdout(train_size=0.8)
neg_train_edge_graph.dump_edges(neg_train_edge_path, edge_type_column='predicate')
neg_valid_edge_graph.dump_edges(neg_valid_edge_path, edge_type_column='predicate')

## Generating embeddings and building classifiers with NEAT

The [NEAT](https://github.com/Knowledge-Graph-Hub/neat-ml) package provides a way to define graph machine learning tasks with a single configuration file. We'll generate such a file here, then run NEAT to produce embeddings and a link prediction classifier.

We'll start by defining some basic parameters, largely based on what we did in the previous section.

In [None]:
%pip install neat-ml -U
%pip install scikit-learn

In [None]:
directed = False # Yes, this is technically a directed network, but we'll treat it as undirected
node_path = merged_node_filename # Positive training nodes
edge_path = train_edge_path # Positive training edges
#valid_edge_path - we've already defined this above
#neg_train_edge_path - we've already defined this above
#neg_valid_edge_path - we've already defined this above

# Embedding parameters
embedding_file_name = "embeddings.tsv"
embedding_history_file_name = "embedding_history.json"
node_embedding_method_name = "SPINE"

# Classifier parameters - NEAT can build multiple classifier types in one run, if specified in the configuration file
edge_method = "Average" # one of EdgeTransformer.methods: Hadamard, Sum, Average, L1, AbsoluteL1, L2, or alternatively a lambda
classifier_type = "Logistic Regression"
classifier_model_outfile = "model_lr.model"
classifier_model_type = "sklearn.linear_model.LogisticRegression"
classifier_model_random_state = 42
classifier_model_max_iter = 1000

# Output parameters
output_directory = "./"
config_filename = "scallops.yaml"

We'll set this up as a string because the final destination for these parameters is a YAML-format config file.

In [None]:
outstring = f"""
---
Target:
  target_path: {output_directory}

GraphDataConfiguration:
  graph:
    directed: {directed}
    node_path: {node_path}
    edge_path: {edge_path}
    verbose: True
    nodes_column: "id"
    node_list_node_types_column: "category"
    default_node_type: "biolink:NamedThing"
    sources_column: "subject"
    destinations_column: "object"
    default_edge_type: "biolink:related_to"
  evaluation_data:
    valid_data:
      pos_edge_filepath: {valid_edge_path}
      neg_edge_filepath: {neg_valid_edge_path}
    train_data:
      neg_edge_filepath: {neg_train_edge_path}

EmbeddingsConfig:
  filename: {embedding_file_name}
  history_filename: {embedding_history_file_name}
  node_embeddings_params:
    method_name: {node_embedding_method_name}
  tsne_filename: tsne.png

ClassifierContainer:
  classifiers:
    - classifier_id: lr_1
      classifier_name: {classifier_type}
      classifier_type: {classifier_model_type}
      edge_method: {edge_method}
      outfile: {classifier_model_outfile}
      parameters:
        sklearn_params:
          random_state: {classifier_model_random_state}
          max_iter: {classifier_model_max_iter}

ApplyTrainedModelsContainer:
  models:
    - model_id: lr_1
      node_types:
        source:
          - "biolink:NamedThing"
        destination:
          - "biolink:NamedThing"
      cutoff: 0.9
      outfile: lr_protein_predictions.tsv

"""
print(outstring)
with open(config_filename, "w") as outfile:
    outfile.write(outstring)

In [None]:
# NOTE: not working properly at the moment! Throws a RuntimeError!
!neat run --config $config_filename

In [None]:
from IPython.display import Image
Image(filename='tsne.png')