Update doc index.html

Bluefog-Lib · Nov 6, 2020 · a937a1d · a937a1d
1 parent 30539b0
commit a937a1d
Show file tree

Hide file tree

Showing 2 changed files with 35 additions and 109 deletions.
diff --git a/README.rst b/README.rst
@@ -63,7 +63,7 @@ Quick Start
 First, make sure your environment is with ``python>=3.7`` and ``openmpi >= 4.0``.
 Then, install Bluefog with: ``pip install --no-cache-dir bluefog`` or
 ``BLUEFOG_WITH_NCCL=1 pip install bluefog`` if NCCL is supported (``NCCL>=2.7``). Check
-the ``install_bluefog`` page if you need more information or other install options.
+the `install_bluefog <https://bluefog-lib.github.io/bluefog/install.html>`_ page if you need more information or other install options.
 
 We provide high-level wrapper for torch optimizer. You just need to modify
 the existing script to distributed implementation is wrapping the optimizer
@@ -110,7 +110,8 @@ more code:
         optimizer.step()
 
 Check our BlueFog `dynamic topology neighbor averaging <https://bluefog-lib.github.io/bluefog/neighbor_average.html>`_
-page to see more on how to control and use topology.
+page to see more on how to control and use topology. See BlueFog `examples`_ folder for full code.
+
 
 We also provide lots of low-level functions, which you can use those as building
 blocks to construct your own distributed training algorithm. The following example
@@ -136,3 +137,4 @@ Citation
 *BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning*, Bluefog Team, To be Appeared in 2020
 
 .. _AWS: https://aws.amazon.com/about-aws/whats-new/2018/12/introducing-amazon-ec2-p3dn-instances-our-most-powerful-gpu-instance-yet/
+.. _examples: https://github.com/Bluefog-Lib/bluefog/tree/master/examples
diff --git a/docs/index.rst b/docs/index.rst
@@ -11,112 +11,34 @@ Bluefog
    :align: center
    :alt: Bluefog Logo
 
-Bluefog is a distributed training framework for PyTorch based
-on diffusion/consensus-type algorithm.
-The goal of Bluefog is to make distributed and decentralized machine learning fast,
-fault-tolerant, friendly to heterogeneous environment, and easy to use.
-
-The most distinguishable feature of Bluefog compared with other popular distributed training frameworks, such as 
-DistributedDataParallel provided by pytorch, Horovod, BytePS, etc., is that our core implementation rooted on the idea
-that we introduce the virtual topology into the multiple processes and local averaging will converge to the global average
-as iteration keep going, where local averaging is defined based on the connection in the virtual topology.
-
-Leveraging the *One-sided Communication Ops* (i.e. remote-memory access) of MPI, Bluefog is not only distributed 
-but also decentralized training framework with high performance.
-
-* Unlike the *ring-allreduce* based Bulk Synchronous Parallel algorithm, each process in bluefog is highly decoupled so that we can maximize the power of asynchronous algorithm. 
-* Unlike the *parameter-server* (PS) based architecture, there is no central node to collect and distribute information so that we can avoid the bottleneck problem existing in the PS. 
-
-The cost of those advantages is the inconsistence between models. Please check our papers to see the theoratical guarantee.
-
-.. Important::
-
-   Although the most torch-based APIs perform well, this repository is still 
-   in the early stage of development and more features are waiting to be implemented.
-   If you are interested, you are more than welcome to join us and contribute this project!
-
-Quick Start
------------
-
-First, make sure your environment is with ``python>=3.7`` and `openmpi`_ >= 4.0.
-Then, install Bluefog with: ``pip install --no-cache-dir bluefog``.  Check
-the :ref:`install_bluefog` page if you need more information or other install options.
-We provide high-level wrapper for torch optimizer. 
-Probably, the only thing you need to modify
-the existing script to distributed implementation is wrapping the optimizer
-with our ``DistributedNeighborAllreduceOptimizer``,
-then run it through ``bfrun``. That is it!
-
-.. code-block:: python
-
-   # Execute Python functions in parallel through
-   # bfrun -np 4 python file.py
-
-   import torch 
-   import bluefog.torch as bf
-   ...
-   bf.init()
-   optimizer = optim.SGD(model.parameters(), lr=lr * bf.size())
-   optimizer = bf.DistributedNeighborAllreduceOptimizer(
-      optimizer, model
-   )
-   ...
-
-We also provide lots of low-level functions, which you can use those as building
-blocks to construct your own distributed trainning algorithm. The following example
-illustrates how to run a simple consensus algorithm through bluefog.
-
-.. code-block:: python
-
-   import torch
-   import bluefog.torch as bf
-
-   bf.init()
-   x = torch.Tensor([bf.rank()])
-   for _ in range(100):
-      x = bf.neighbor_allreduce(x)
-   print(f"{bf.rank()}: Average value of all ranks is {x}")
-
-One noteable feature of Bluefog is that we leverage the One-sided Communication of MPI
-to build a real decentralized and asynchronized algorithms. This is another example about
-how to use Bluefog to implement an asynchronized push-sum consensus algorithm.
-
-.. code-block:: python
-
-   import torch
-   import bluefog.torch as bf
-   from bluefog.common import topology_util
-
-   bf.init()
-
-   # Setup the topology for communication
-   bf.set_topology(topology_util.PowerGraph(bf.size()))
-   outdegree = len(bf.out_neighbor_ranks())
-   indegree = len(bf.in_neighbor_ranks())
-
-   # Create the buffer for neighbors.
-   x = torch.Tensor([bf.rank(), 1.0])
-   bf.win_create(x, name="x_buff", zero_init=True)
-
-   for _ in range(100):
-      bf.win_accumulate(
-         x, name="x_buff",
-         dst_weights={rank: 1.0 / (outdegree + 1)
-                      for rank in bf.out_neighbor_ranks()},
-         require_mutex=True)
-      x.div_(1+outdegree)
-      bf.win_update_then_collect(name="x_buff")
-
-   bf.barrier()
-   # Do not forget to sync at last!
-   bf.win_update_then_collect(name="x_buff")
-   print(f"{bf.rank()}: Average value of all ranks is {x[0]/x[-1]}")
-
-Please explore our `examples`_ folder to see more about
-how to implemented deep learning trainning and distributed 
-optimization algorithm quickly and easily through bluefog. If you want to understand more on
-how to use the low-level API as the building blocks for your own distributed
-algorithm, please read our :ref:`Ops Explanation` page.
+OVERVIEW
+--------
+
+BlueFog is a high-performance distributed training framework for PyTorch built with **decentralized optimization** algorithms. 
+The goal of Bluefog is to make decentralized algorithms easy to use, fault-tolerant, friendly to heterogeneous environment, 
+and even faster than training frameworks built with parameter server, or ring-allreduce. 
+
+In each communication stage, neither the typical star-shaped parameter-server toplogy, nor the pipelined ring-allreduce topology is used, which
+is fundamentally different from other popular distributed training frameworks, such as DistributedDataParallel provided by PyTorch, Horovod, BytePS, etc. 
+Instead, BlueFog will exploit a virtual and probably dynamic network topology (that can be in any shape) to achieve most communication efficiency.
+
+..
+    
+    Main Idea: Replace expensive allreduce averaging over gradients by cheap neighbor averaging over parameters
+
+For each training iteration, one process (or agent) will update its model with information received from its **direct** neighbors defined by the virtual topology. It is observed all communications only occur over the predefied virtual topolgy and no global communication is required. This is why the algorithms is named *decentralized*. 
+Decentralized training algorithms are proved in literature that it can converge to the same solution as their standard centralized counterparts. 
+
+The topology decides the communication efficiency. BlueFog supports both **static** topology and **dynamic** topology usages. After tremendous trials, the dynamic Exponential-2 graph is observed to achieve the best performance
+if the number of agents is the power of 2, such as 4, 32, 128 agents. In Exponential-2 graph, each agent will 
+communicates with the neighbors that are  2 :sup:`0`, 2 :sup:`1`, ..., 2 :sup:`t` hops away. **Dynamic** toplogy means all agents select
+one neighbor only in one iteration and select next neighbor in next iteration as illustrated in the following figure:
+
+.. raw:: html
+
+    <p align="center"><img src="https://user-images.githubusercontent.com/16711681/97928035-04654400-1d1b-11eb-91d2-2da890b4522e.png" alt="one-peer-exp2" width="650"/></p>
+
+In this scenario, the communcation cost for each iteration is only one unit delay, one standard parameter size to transmit and no communication conflict happens, which is better than what parameter server or ring-allreduce promises. As for loss and accuracy guarantees, please check out our theoratical paper. [Will add a full tutorial soon].
 
 .. toctree::
    :maxdepth: 1
@@ -148,4 +70,6 @@ algorithm, please read our :ref:`Ops Explanation` page.
    Development Guide <devel_guide>
 
 .. _openmpi: https://www.open-mpi.org/
-.. _examples: https://github.com/ybc1991/bluefog/tree/master/examples
+.. _examples: https://github.com/Bluefog-Lib/bluefog/tree/master/examples
+.. _AWS: https://aws.amazon.com/about-aws/whats-new/2018/12/introducing-amazon-ec2-p3dn-instances-our-most-powerful-gpu-instance-yet/
+