adds config description to help

JSybrandt · May 18, 2020 · b22e763 · b22e763
1 parent 74e88c9
commit b22e763
Showing 1 changed file with 99 additions and 2 deletions.
diff --git a/docs/help/embed_semantic_graph.md b/docs/help/embed_semantic_graph.md
@@ -80,8 +80,105 @@ will not be included in the output.
 Note that the argument passed with `--relations` should be a string with
 space-separated relationship types. Each relationship should be a two character
 long string. Relationships are also directed in PTBG, meaning that if you would
-like to select both UMLS -> predicate edges, as well as predicate -> UMLS edges,
-you will need to specify both edge types.
+like to select both `UMLS -> predicate` edges, as well as `predicate -> UMLS`
+edges, you will need to specify both edge types.
 
 **WARNING:** You will need to remember the order you list the relationships.
 This will determine the order of relationships in the PTBG config.
+
+## Create a PTBG Config
+
+Now that you have converted the agatha semantic graph for PTBG, you now need to
+write a configuration script. Here's the [official docs for the PTBG
+config](https://torchbiggraph.readthedocs.io/en/latest/configuration_file.html).
+The following is an example PTBG config. The parts you need to worry about occur
+in the header section of the `get_torchbiggraph_config` function. You should
+copy this and change what you need.
+
+```python3
+#!/usr/bin/env python3
+def get_torchbiggraph_config():
+
+    # CHANGE THESE #########################################################
+
+    DATA_ROOT = "/path/to/data/root"
+    """ This is the location you specified with the `-o` flag when running
+    `convert_graph_for_pytorch_biggraph` That tools should have created
+    `DATA_ROOT/entities` and `DATA_ROOT/edges`. This process will create
+    `DATA_ROOT/embeddings`.  """
+
+    PARTS = 100
+    """ This is the number of partitions that all nodes and edges have been
+    split between when running `convert_graph_for_pytorch_biggraph`. By default,
+    we create 100 partitions. If you specified `--partition-count` (`-c`), then
+    you need to change this value to reflect the new partition count.  """
+
+    ENT_TYPES = "selmnp"
+    """ This is the set of entities specified when running
+    `convert_graph_for_pytorch_biggraph`. The above value is the default. If you
+    used the `--types` flag, then you need to set this value accordingly."""
+
+    RELATIONS = [ "ss", "se", "es", "sl", "ls", "sm", "ms", "sn", "ns", "sp",
+                  "ps", "pn", "np", "pm", "mp", "pl", "lp", "pe", "ep" ]
+    """ This is the ordered list of relationships that you specified when
+    running `convert_graph_for_pytorch_biggraph`. The above is the default. If
+    you specified `--relations` then you need to set this value accordingly.
+    WARNING: The order of relationships matters! This list should be in the same
+    order as the relationships specified in the `--relations` argument.
+    """
+
+    EMBEDDING_DIM = 512
+    """ This is the number of floats per embedding per node in the resulting
+    embedding. """
+
+    NUM_COMPUTE_NODES = 20
+    """ This is the number of computers used to compute the embedding. We find
+    that around 20 machines is the sweet spot. More or less result in slower
+    embeddings. """
+
+    THREADS_PER_NODE = 24
+    """ This is the number of threads that each machine will use to compute
+    embeddings. """
+
+    #########################################################################
+
+    config = dict(
+        # IO Paths
+        entity_path=DATA_ROOT+"/entities",
+        edge_paths=[DATA_ROOT+"/edges"],
+        checkpoint_path=DATA_ROOT+"/embeddings",
+
+        # Graph structure
+        entities={t: {'num_partitions': PARTS} for t in ENT_TYPES},
+        relations=[
+          dict(name=rel, lhs=rel[0], rhs=rel[1], operator='translation')
+          for rel in RELATIONS
+        ],
+
+        # Scoring model
+        dimension=EMBEDDING_DIM,
+        comparator='dot',
+        bias=True,
+
+        # Training
+        num_epochs=5,
+        num_uniform_negs=50,
+        loss_fn='softmax',
+        lr=0.02,
+
+        # Evaluation during training
+        eval_fraction=0,
+
+        # One per allowed thread
+        workers=THREADS_PER_NODE,
+        num_machines=NUM_COMPUTE_NODES,
+        distributed_init_method="env://",
+        num_partition_servers=-1,
+    )
+
+    return config
+```
+
+## Launch the PTBG training cluster
+
+Now you are ready to start training!