# A NEAT Demonstration

NEAT provides a way to define graph machine learning tasks with minimal coding, an uncomplicated interface, and a process created with cloud compute in mind.

This notebook provides a demonstration of how to set up a NEAT configuration file. We define the values for parameters, write them to a YAML file, then pass that file to NEAT to generate graph embeddings.

You're likely reading this notebook while within the NEAT repository. If you haven't installed NEAT yet, please do so now using the next code block.

In [None]:
%cd ..
!pip install .
%cd notebooks/

## Define graph parameters

For demonstration purposes, we'll use a copy of the [ECTO ontology](https://obofoundry.org/ontology/ecto.html), pre-processed to graph form by [KG-OBO](https://github.com/Knowledge-Graph-Hub/kg-obo).

In [None]:
!wget https://kg-hub.berkeleybop.io/kg-obo/ecto/2022-03-09/ecto_kgx_tsv.tar.gz
!tar xvzf ecto_kgx_tsv.tar.gz

Now define the following graph parameters or just use the default values.

In [8]:
directed = False # Yes, this is technically a directed network, but we'll treat it as undirected
node_path = "ecto_kgx_tsv_nodes.tsv" # Our positive training nodes
edge_path = "ecto_kgx_tsv_edges.tsv" # Our positive training edges

We may want to have a positive validation set, too. Let's use the same edge file as above for now.

In [3]:
pos_valid_edge_path = edge_path

Negative data would be nice to have, too. We'll define paths for negative and validation edge lists here, but won't create them until the next section.

In [4]:
neg_train_edge_path = "negative_edges.tsv"
neg_valid_edge_path = "negative_valid_edges.tsv"

## Loading a graph and generating subgraphs

First task: load our training nodes and edges with Ensmallen and get some details about the graph.

In [5]:
from ensmallen import Graph

In [11]:
g= Graph.from_csv(
  directed=directed,
  node_path=node_path,
  edge_path=edge_path,
  verbose=True,
  nodes_column='id',
  node_list_node_types_column='category',
  default_node_type='biolink:NamedThing',
  sources_column='subject',
  destinations_column='object',
  edge_list_edge_types_column='predicate'
)
g

Now let's generate some negative edge sets. The following method will create a new Graph object with the specified number of edges, where each is a negative edge in the original graph. It doesn't impact nodes at all, so we remove the now-disconnected nodes for the sake of cleanliness.

In [18]:
negative_graph = g.sample_negatives(49622) # Just as many negative examples as positive examples
negative_graph = negative_graph.drop_disconnected_nodes()
negative_graph

Now we'll write out those edge files and continue with defining the NEAT configuration.

In [20]:
negative_graph.dump_edges(neg_train_edge_path, edges_type_column='predicate')
negative_graph.dump_edges(neg_valid_edge_path, edges_type_column='predicate')

## Define embedding parameters

These parameters are quite simple for purposes of the demonstration.

In [21]:
embedding_file_name = "demo_embeddings.tsv"
embedding_history_file_name = "embedding_history.json"
node_embedding_method_name = "CBOW" # one of 'CBOW', 'GloVe', 'SkipGram', 'Siamese', 'TransE', 'SimplE', 'TransH', 'TransR'
walk_length = 10 # typically 100 or so
batch_size = 128 # typically 512? or more
window_size = 4
iterations = 5 # typically 20 or more

## Define classifier parameters

Here, we define a single classifier, but NEAT will accept a list of multiple classifier types.

We're going to set it up as if we were building a link prediction model. NEAT would prefer to have positive *and* negative training/validation data for this. Conveniently, we've provided paths to those above. 

In [22]:
edge_method = "Average" # one of EdgeTransformer.methods: Hadamard, Sum, Average, L1, AbsoluteL1, L2, or alternatively a lambda
classifier_type = "Logistic Regression"
classifier_model_outfile = "model_lr_demo"
classifier_model_type = "sklearn.linear_model.LogisticRegression"
classifier_model_random_state = 42
classifier_model_max_iter = 1000

## Define output parameters

We specify a local output path here, but NEAT can also upload to S3, given a bucket name and directory.

In [23]:
output_directory = "./"

config_filename = "demonstrate.yaml"

## Wrap it all up

In [24]:
outstring = f"""
graph_data:
  graph:
    directed: {directed}
    node_path: {node_path}
    edge_path: {edge_path}
    verbose: True
    nodes_column: 'id'
    node_list_node_types_column: 'category'
    default_node_type: 'biolink:NamedThing'
    sources_column: 'subject'
    destinations_column: 'object'
    default_edge_type: 'biolink:related_to'
  pos_validation:
    edge_path: {pos_valid_edge_path}
  neg_training:
    edge_path: {neg_train_edge_path}
  neg_validation:
    edge_path: {neg_valid_edge_path}

embeddings:
  embedding_file_name: {embedding_file_name}
  embedding_history_file_name: {embedding_history_file_name}
  node_embedding_params:
      node_embedding_method_name: {node_embedding_method_name}
      walk_length: {walk_length}
      batch_size: {batch_size}
      window_size: {window_size}
      return_weight: 1.0
      explore_weight: 1.0
      iterations: {iterations}
      use_mirrored_strategy: False

  tsne:
    tsne_file_name: tsne.png

classifier:
  edge_method: {edge_method}
  classifiers:
    - type: {classifier_type}
      model:
        outfile: {classifier_model_outfile}
        type: {classifier_model_type}
        parameters:
          random_state: {classifier_model_random_state}
          max_iter: {classifier_model_max_iter}

output_directory: {output_directory}
"""
print(outstring)


graph_data:
  graph:
    directed: False
    node_path: ecto_kgx_tsv_nodes.tsv
    edge_path: ecto_kgx_tsv_edges.tsv
    verbose: True
    nodes_column: 'id'
    node_list_node_types_column: 'category'
    default_node_type: 'biolink:NamedThing'
    sources_column: 'subject'
    destinations_column: 'object'
    default_edge_type: 'biolink:related_to'
  pos_validation:
    edge_path: ecto_kgx_tsv_edges.tsv
  neg_training:
    edge_path: negative_edges.tsv
  neg_validation:
    edge_path: negative_valid_edges.tsv

embeddings:
  embedding_file_name: demo_embeddings.tsv
  embedding_history_file_name: embedding_history.json
  node_embedding_params:
      node_embedding_method_name: CBOW
      walk_length: 10
      batch_size: 128
      window_size: 4
      return_weight: 1.0
      explore_weight: 1.0
      iterations: 5
      use_mirrored_strategy: False

  tsne:
    tsne_file_name: tsne.png

classifier:
  edge_method: Average
  classifiers:
    - type: Logistic Regression
      model:
  

In [25]:
with open(config_filename, "w") as outfile:
    outfile.write(outstring)

In [26]:
!neat run --config $config_filename

Traceback (most recent call last):
  File "/home/harry/neat-env/bin/neat", line 8, in <module>
    sys.exit(cli())
  File "/home/harry/neat-env/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/neat-env/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/harry/neat-env/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/neat-env/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/neat-env/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/neat-env/lib/python3.8/site-packages/neat/cli.py", line 41, in run
    if not pre_run_checks(yhelp=yhelp):
  File "/home/harry/neat-env/lib/python3.8/site-packages/nea

In [None]:
from IPython.display import Image
Image(filename='tsne.png') 