Skip to content

Commit

Permalink
doc: connection; bump version
Browse files Browse the repository at this point in the history
also add an installation script for Bridge server
  • Loading branch information
Contextualist committed Jan 17, 2022
1 parent 3d91625 commit b2a80f8
Show file tree
Hide file tree
Showing 5 changed files with 136 additions and 3 deletions.
32 changes: 32 additions & 0 deletions bridge/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash
set -e
PREF=~/.local/bin
mkdir -p $PREF
rm -f $PREF/grain-bridge
echo "Downloading the executable..."
curl -fsSL https://github.com/Contextualist/grain/releases/latest/download/grain-bridge-linux-amd64 -o $PREF/grain-bridge
chmod +x $PREF/grain-bridge

cat > $PREF/grain-bridge-daemon <<EOF
#!/usr/bin/env bash
set -e
if ! screen -list | grep -q 'grain-bridge'; then
screen -dmS grain-bridge
fi
if ! pgrep -x 'grain-bridge' > /dev/null; then
screen -r grain-bridge -X stuff \$'grain-bridge\n'
fi
echo "The Bridge server is running in Screen session 'grain-bridge'."
echo "Dial/listen with this Bridge server by using one of the following URLs:"
IPs=\$(hostname -I)
for IP in \$IPs; do
h=\$(host \$IP | rev | cut -d' ' -f1 | rev)
h=\${h%?}
echo -e "\tbridge://YOUR-KEY-NAME@\$h:9555"
echo -e "\tbridge://YOUR-KEY-NAME@\$IP:9555"
done
EOF
chmod +x $PREF/grain-bridge-daemon

echo "Bridge server 'grain-bridge' has been installed at $PREF; please make sure that it is on your \$PATH."
echo "Use wrapper script 'grain-bridge-daemon' to start the server in background."
93 changes: 93 additions & 0 deletions docs/source/connection.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
Connection discovery and Communication
======================================

Network is what connects the nodes, or individual computers, in a supercomputing
cluster. To talk over the network, head and worker processes need to find each
others through a connection protocol: either some form of an "address" or a
discovery service. After a connection is established, they talk with a
communication protocol, a standardized language for efficient exchange of
informaiton. The connection protocol used by head and workers are controlled by
``head.listen`` and ``worker.dial`` in the Grain config. Most of this document
would be specifications / technical details, but the introductory part of each
section is generally useful information.

Available connection protocols:

- `TCP <https://en.wikipedia.org/wiki/Transmission_Control_Protocol>`__/IP
address (``tcp:\\\\``), the vanilla network protocol
- `Unix domain socket <https://en.wikipedia.org/wiki/Unix_domain_socket>`__
(``unix:\\\\``), like TCP, but for same-host connection
- Bridge (``bridge:\\\\``), connection discovery through a coordinator server
- Edge (``edge:\\\\``), connection discovery through network filesystem

Bridge and Edge protocols are private connection discovery methods of Grain.
Connection discovery is useful in a supercomputing cluster because the allocated
machines are different from time to time, and we need a fixed information source
to keep tracks of these transient addresses. Connections established by these
methods are eventually TCP sockets. The following contents assume a basic
knowledge of TCP sockets.


Bridge protocol
---------------

Bridge protocol enables connection discovery through a Bridge server, a
third-party, always-on service with a fixed public address. Before establishing
a connection, the dialers and listener contact the Bridge server to arrange a
rendezvous: they get the other's address from the Bridge server, then try to
establish a connection through TCP hole punching. Since the Bridge server
notifies both parties once both of them are ready, a dialer can "initiate" the
connection before the listener starts listening, allowing a more flexible
connection process compared to the original TCP connection.

.. code:: none
bridge://{key}@{bridge_addr}:{bridge_port}[?iface={interface}]
where ``key`` is a pre-shared string for dialers and the listener to identify
each others; only those with the same key are allowed to conncect to each
others. ``bridge_addr`` and ``bridge_port`` are the TCP/IP address of the Bridge
server. ``interface`` optionally specifies the network interface for local
address.

Setup a bridge server
~~~~~~~~~~~~~~~~~~~~~

On a host whose network is accessible to your head and workers' hosts, run the
following install script:

.. code:: bash
curl -sSfL https://github.com/Contextualist/grain/raw/master/bridge/install.sh | bash
The installation comes with setup and usage instructions.

Caveat
~~~~~~

Even though TCP hole punching is able to establish a connection in most of the
network environment, it might not work for certain types of NAT, or in situations
where firewalls prohibit connections between the network interfaces of the two
parties in both directions. Sometimes an alternative network interface might work
if the default one fails.

Rendezvous specification
~~~~~~~~~~~~~~~~~~~~~~~~

TODO


Edge protocol
-------------

TODO

..
caveat: well-behaved NFS; firewall;
..
edge file specification

Msgpack schema for head-worker communication
--------------------------------------------
4 changes: 2 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@ commit messages provide better information for the features and fixes)
tutorial_delayed.rst
api_delayed.rst
resource.rst
connection.rst
util.rst

Work in progress:

* Resource: a language for coordination
* Bridge protocol: universal connection
* Context module: plugin system for worker
* Advanced usage
* FAQ
* Low-level API reference

Expand Down
8 changes: 8 additions & 0 deletions docs/source/tutorial_delayed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,14 @@ our jobs finish, all workers quit, too. You can repeat this with different resou
to the jobs, add delays in the jobs using ``trio.sleep``, and try to see if you can make
the jobs running on different computation nodes.

.. note::

You might notice that the workers do not leave immediately after all the computation is
done. That is because the scheduler is still running in the background, so that if you
start another calculation mission shortly, the workers can be reused. You can also run
multiple missions concurrently, sharing a swarm of workers. Missions (i.e. the head
processes) running on the same machine with the same `head.listen` config will reuse the
scheduler.

What's next?
------------
Expand Down
2 changes: 1 addition & 1 deletion grain/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.15.2+dev"
__version__ = "0.16.0"

0 comments on commit b2a80f8

Please sign in to comment.