# KG-OntoML: Build Embeddings and a Link Prediction Classifier

Get requirements.

In [None]:
%pip install -q grape -U
#!pip install -q plot_keras_history seedir silence_tensorflow
# !pip install -q tsnecuda==3.0.0+cu110 -f https://tsnecuda.isx.ai/tsnecuda_stable.html --no-dependencies
# %pip install -q faiss

# In order to disable often useless TensorFlow warnings
import silence_tensorflow.auto

from ensmallen import Graph


Set up NEAT.

In [None]:
!pip install git+https://github.com/Knowledge-Graph-Hub/NEAT.git

Retrieve the KG-OntoML graph, decompress it, and check it.

In [None]:
!wget https://kg-hub.berkeleybop.io/kg-ontoml/20220304/KG-OntoML.tar.gz

In [None]:
!tar -xvzf KG-OntoML.tar.gz

In [None]:
!head merged-kg_nodes.tsv

In [None]:
!head merged-kg_edges.tsv

In [None]:
g= Graph.from_csv(
  directed=False,
  node_path='merged-kg_nodes.tsv',
  edge_path='merged-kg_edges.tsv',
  verbose=True,
  nodes_column='id',
  node_list_node_types_column='category',
  default_node_type='biolink:NamedThing',
  sources_column='subject',
  destinations_column='object',
  default_edge_type='biolink:related_to',
  edge_list_edge_types_column='predicate'
)
g

Now it's time to build the embeddings.

Even on a GPU, the following may take >16 min per epoch, so it's not ideal for demonstration purposes. A full SkipGram embedding on KG-OntoML with default parameters requires at least 12 epochs (>3 hours).

One option is to pass `compute_node_embedding` some extra arguments. 
* `use_mirrored_strategy=False` - disables multi-GPU support, but may help avoid some errors
* `iterations=1` - perform a single iteration only
* `walk_length=16` - the lowest reasonable distance for a random walk, for a SkipGram at least
* `verbose=2` - applies to all embedding methods; a useful level of verbosity

This method will also take the argument `fit_kwargs` - this is a dictionary - and any or all of the following key:values:

* `"epochs": 2` - the default for some methods is essentially forever
* `"early_stopping_patience": 1` - Start checking for early stop after first epoch
* `"early_stopping_min_delta": 0.5` - Minimum delta loss to stop training

See https://github.com/monarch-initiative/embiggen/blob/master/embiggen/embedders/embedder.py#L272 for other params.

Running this in Embiggen may look like this:
```
from embiggen.pipelines import compute_node_embedding
from plot_keras_history import plot_history

node_embedding_method_name = "SkipGram"

first_order_rw_node_embedding, training_history = compute_node_embedding(
    g,
    use_mirrored_strategy=False,
    node_embedding_method_name=node_embedding_method_name,
    verbose=2,
)

plot_history(
    training_history,
    title="First-order random walk based {} model applied to graph {}".format(
        node_embedding_method_name,
        g.get_name()
    )
)
```

But we'll set up a NEAT config for the sake of reproducibility.

In [None]:
outstring = f"""
graph_data:
  graph:
    directed: False
    node_path: merged-kg_nodes.tsv
    edge_path: merged-kg_edges.tsv
    verbose: True
    nodes_column: 'id'
    node_list_node_types_column: 'category'
    default_node_type: 'biolink:NamedThing'
    sources_column: 'subject'
    destinations_column: 'object'
    default_edge_type: 'biolink:related_to'
  pos_validation:
    edge_path: merged-kg_edges.tsv
  neg_training:
    edge_path: negative_edges.tsv
  neg_validation:
    edge_path: negative_valid_edges.tsv

embeddings:
  embedding_file_name: KG-OntoML-SkipGram
  embedding_history_file_name: KG-OntoML-SkipGram_history
  node_embedding_params:
      node_embedding_method_name: SkipGram
      walk_length: 100
      batch_size: 128
      window_size: 4
      return_weight: 1.0
      explore_weight: 1.0
      iterations: 20
      use_mirrored_strategy: False

  tsne:
    tsne_file_name: tsne.png

classifier:
  edge_method: Average
  classifiers:
    - type: Logistic Regression
      model:
        outfile: model_lr_kg-ontoml
        type: sklearn.linear_model.LogisticRegression
        parameters:
          random_state: 42
          max_iter: 1000

output_directory: "./"
"""
print(outstring)

In [None]:
with open("KG-OntoML-NEAT.yaml", "w") as outfile:
    outfile.write(outstring)

In [None]:
!neat run --config $config_filename

In [None]:
from IPython.display import Image
Image(filename='tsne.png') 