THUDM · cenyk1230 · Oct 13, 2021 · Aug 7, 2021 · Aug 7, 2021 · Aug 16, 2021
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -10,6 +10,6 @@ formats: all
 
 # Optionally set the version of Python and requirements required to build your docs
 python:
-  version: 3.6
+  version: 3.7
   install:
     - requirements: docs/requirements.txt
diff --git a/.travis.yml b/.travis.yml
@@ -1,14 +1,14 @@
 language: python
 
 python:
-  - "3.6"
+  - "3.7"
 
 install:
-  - pip install https://download.pytorch.org/whl/cpu/torch-1.7.1%2Bcpu-cp36-cp36m-linux_x86_64.whl
-  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_scatter-2.0.7-cp36-cp36m-linux_x86_64.whl
-  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_sparse-0.6.9-cp36-cp36m-linux_x86_64.whl
-  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_cluster-1.5.9-cp36-cp36m-linux_x86_64.whl
-  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_spline_conv-1.2.1-cp36-cp36m-linux_x86_64.whl
+  - pip install https://download.pytorch.org/whl/cpu/torch-1.7.1%2Bcpu-cp37-cp37m-linux_x86_64.whl
+  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_scatter-2.0.7-cp37-cp37m-linux_x86_64.whl
+  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_sparse-0.6.9-cp37-cp37m-linux_x86_64.whl
+  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl
+  - pip install https://pytorch-geometric.com/whl/torch-1.7.0+cpu/torch_spline_conv-1.2.1-cp37-cp37m-linux_x86_64.whl
   - pip install torch-geometric
   - pip install dgl==0.4.3
   - pip install packaging==20.9

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,2 +1 @@
-include cogdl/match.yml
 include cogdl/operators/*
diff --git a/README.md b/README.md
@@ -21,18 +21,20 @@ We summarize the contributions of CogDL as follows:
 
 ## ❗ News
 
+- The new **v0.5.0b1 pre-release** designs and implements a unified training loop for GNN. It introduces `DataWrapper` to help prepare the training/validation/test data and `ModelWrapper` to define the training/validation/test steps. 
+
 - The new **v0.4.1 release** adds the implementation of Deep GNNs and the recommendation task. It also supports new pipelines for generating embeddings and recommendation. Welcome to join our tutorial on KDD 2021 at 10:30 am - 12:00 am, Aug. 14th (Singapore Time). More details can be found in https://kdd2021graph.github.io/. 🎉
 
 - The new **v0.4.0 release** refactors the data storage (from `Data` to `Graph`) and provides more fast operators to speed up GNN training. It also includes many self-supervised learning methods on graphs. BTW, we are glad to announce that we will give a tutorial on KDD 2021 in August. Please see [this link](https://kdd2021graph.github.io/) for more details. 🎉
 
-- CogDL supports GNN models with Mixture of Experts (MoE). You can install [FastMoE](https://github.com/laekov/fastmoe) and try **[MoE GCN](./cogdl/models/nn/moe_gcn.py)** in CogDL now!
-
 <details>
 <summary>
 News History
 </summary>
 <br/>
 
+- CogDL supports GNN models with Mixture of Experts (MoE). You can install [FastMoE](https://github.com/laekov/fastmoe) and try **[MoE GCN](./cogdl/models/nn/moe_gcn.py)** in CogDL now!
+
 - The new **v0.3.0 release** provides a fast spmm operator to speed up GNN training. We also release the first version of **[CogDL paper](https://arxiv.org/abs/2103.00959)** in arXiv. You can join [our slack](https://join.slack.com/t/cogdl/shared_invite/zt-b9b4a49j-2aMB035qZKxvjV4vqf0hEg) for discussion. 🎉🎉🎉
 
 - The new **v0.2.0 release** includes easy-to-use `experiment` and `pipeline` APIs for all experiments and applications. The `experiment` API supports automl features of searching hyper-parameters. This release also provides `OAGBert` API for model inference (`OAGBert` is trained on large-scale academic corpus by our lab). Some features and models are added by the open source community (thanks to all the contributors 🎉).
@@ -47,7 +49,7 @@ News History
 
 ### Requirements and Installation
 
-- Python version >= 3.6
+- Python version >= 3.7
 - PyTorch version >= 1.7.1
 
 Please follow the instructions here to install PyTorch (https://github.com/pytorch/pytorch#installation).
@@ -83,34 +85,30 @@ A quickstart example can be found in the [quick_start.py](https://github.com/THU
 from cogdl import experiment
 
 # basic usage
-experiment(task="node_classification", dataset="cora", model="gcn")
+experiment(dataset="cora", model="gcn")
 
 # set other hyper-parameters
-experiment(task="node_classification", dataset="cora", model="gcn", hidden_size=32, max_epoch=200)
+experiment(dataset="cora", model="gcn", hidden_size=32, max_epoch=200)
 
 # run over multiple models on different seeds
-experiment(task="node_classification", dataset="cora", model=["gcn", "gat"], seed=[1, 2])
+experiment(dataset="cora", model=["gcn", "gat"], seed=[1, 2])
 
 # automl usage
-def func_search(trial):
+def search_space(trial):
     return {
         "lr": trial.suggest_categorical("lr", [1e-3, 5e-3, 1e-2]),
         "hidden_size": trial.suggest_categorical("hidden_size", [32, 64, 128]),
         "dropout": trial.suggest_uniform("dropout", 0.5, 0.8),
     }
 
-experiment(task="node_classification", dataset="cora", model="gcn", seed=[1, 2], func_search=func_search)
+experiment(dataset="cora", model="gcn", seed=[1, 2], search_space=search_space)
 ```
 
 Some interesting applications can be used through `pipeline` API. An example can be found in the [pipeline.py](https://github.com/THUDM/cogdl/tree/master/examples/pipeline.py). 
 
 ```python
 from cogdl import pipeline
 
-# print the statistics of datasets
-stats = pipeline("dataset-stats")
-stats(["cora", "citeseer"])
-
 # load OAGBert model and perform inference
 oagbert = pipeline("oagbert")
 outputs = oagbert(["CogDL is developed by KEG, Tsinghua.", "OAGBert is developed by KEG, Tsinghua."])
@@ -120,26 +118,25 @@ More details of the OAGBert usage can be found [here](./cogdl/oag/README.md).
 
 ### Command-Line Usage
 
-You can also use `python scripts/train.py --task example_task --dataset example_dataset --model example_model` to run example_model on example_data and evaluate it via example_task.
+You can also use `python scripts/train.py --dataset example_dataset --model example_model` to run example_model on example_data.
 
-- --task, downstream tasks to evaluate representation like `node_classification`, `unsupervised_node_classification`, `graph_classification`. More tasks can be found in the [cogdl/tasks](https://github.com/THUDM/cogdl/tree/master/cogdl/tasks).
-- --dataset, dataset name to run, can be a list of datasets with space like `cora citeseer ppi`. Supported datasets include
+- --dataset, dataset name to run, can be a list of datasets with space like `cora citeseer`. Supported datasets include
 'cora', 'citeseer', 'pumbed', 'ppi', 'wikipedia', 'blogcatalog', 'flickr'. More datasets can be found in the [cogdl/datasets](https://github.com/THUDM/cogdl/tree/master/cogdl/datasets).
-- --model, model name to run, can be a list of models like `deepwalk line prone`. Supported models include
+- --model, model name to run, can be a list of models like `gcn gat`. Supported models include
 'gcn', 'gat', 'graphsage', 'deepwalk', 'node2vec', 'hope', 'grarep', 'netmf', 'netsmf', 'prone'. More models can be found in the [cogdl/models](https://github.com/THUDM/cogdl/tree/master/cogdl/models).
 
-For example, if you want to run LINE, NetMF on Wikipedia with unsupervised node classification task, with 5 different seeds:
+For example, if you want to run GCN and GAT on the Cora dataset, with 5 different seeds:
 
 ```bash
-$ python scripts/train.py --task unsupervised_node_classification --dataset wikipedia --model line netmf --seed 0 1 2 3 4
+python scripts/train.py --dataset cora --model gcn gat --seed 0 1 2 3 4
 ```
 
 Expected output:
 
-| Variant                | Micro-F1 0.1   | Micro-F1 0.3   | Micro-F1 0.5   | Micro-F1 0.7   | Micro-F1 0.9   |
-|------------------------|----------------|----------------|----------------|----------------|----------------|
-| ('wikipedia', 'line')  | 0.4069±0.0011  | 0.4071±0.0010  | 0.4055±0.0013  | 0.4054±0.0020  | 0.4080±0.0042  |
-| ('wikipedia', 'netmf') | 0.4551±0.0024  | 0.4932±0.0022  | 0.5046±0.0017  | 0.5084±0.0057  | 0.5125±0.0035  |
+| Variant          | test_acc       | val_acc        |
+|------------------|----------------|----------------|
+| ('cora', 'gcn')  | 0.8050±0.0047  | 0.7940±0.0063  |
+| ('cora', 'gat')  | 0.8234±0.0042  | 0.8088±0.0016  |
 
 If you have ANY difficulties to get things working in the above steps, feel free to open an issue. You can expect a reply within 24 hours.
 
@@ -241,7 +238,7 @@ So how do you do a unit test?
 </details>
 
 ## CogDL Team
-CogDL is developed and maintained by [Tsinghua, BAAI, DAMO Academy, and ZHIPU.AI](https://cogdl.ai/about/). 
+CogDL is developed and maintained by [Tsinghua, ZJU, BAAI, DAMO Academy, and ZHIPU.AI](https://cogdl.ai/about/). 
 
 The core development team can be reached at [cogdlteam@gmail.com](mailto:cogdlteam@gmail.com).
 

diff --git a/README_CN.md b/README_CN.md
@@ -21,18 +21,20 @@ CogDL的特性包括：
 
 ## ❗ 最新
 
+- 最新的 **v0.5.0b1 pre-release** 为图神经网络的训练设计了一套统一的流程. 这个版本去除了原先的`Task`类，引入了`DataWrapper`来准备training/validation/test过程中所需的数据，引入了`ModelWrapper`来定义模型training/validation/test的步骤.
+
 - 最新的 **v0.4.1 release** 增加了深层GNN的实现和推荐任务。这个版本同时提供了新的一些pipeline用于直接获取图表示和搭建推荐应用。欢迎大家参加我们在KDD 2021上的tutorial，时间是8月14号上午10:30 - 12:00（北京时间）。 更多的内容可以查看 https://kdd2021graph.github.io/. 🎉
 
 - 最新的 **v0.4.0版本** 重构了底层的数据存储（从`Data`类变为`Graph`类），并且提供了更多快速的算子来加速图神经网络的训练。这个版本还包含了很多图自监督学习的算法。同时，我们很高兴地宣布我们将在8月份的KDD 2021会议上给一个CogDL相关的tutorial。具体信息请参见[这个链接](https://kdd2021graph.github.io/). 🎉
 
-- CogDL支持图神经网络模型使用混合专家模块（Mixture of Experts, MoE）。 你可以安装[FastMoE](https://github.com/laekov/fastmoe)然后在CogDL中尝试 **[MoE GCN](./cogdl/models/nn/moe_gcn.py)** 模型!
-
 <details>
 <summary>
 历史
 </summary>
 <br/>
 
+- CogDL支持图神经网络模型使用混合专家模块（Mixture of Experts, MoE）。 你可以安装[FastMoE](https://github.com/laekov/fastmoe)然后在CogDL中尝试 **[MoE GCN](./cogdl/models/nn/moe_gcn.py)** 模型!
+
 - 最新的 **v0.3.0版本** 提供了快速的稀疏矩阵乘操作来加速图神经网络模型的训练。我们在arXiv上发布了 **[CogDL paper](https://arxiv.org/abs/2103.00959)** 的初版. 你可以加入[我们的slack](https://join.slack.com/t/cogdl/shared_invite/zt-b9b4a49j-2aMB035qZKxvjV4vqf0hEg)来讨论CogDL相关的内容。🎉
 
 - 最新的 **v0.2.0版本** 包含了非常易用的`experiment`和`pipeline`接口，其中`experiment`接口还支持超参搜索。这个版本还提供了`OAGBert`模型的接口（`OAGBert`是我们实验室推出的在大规模学术语料下训练的模型）。这个版本的很多内容是由开源社区的小伙伴们提供的，感谢大家的支持！🎉
@@ -47,7 +49,7 @@ CogDL的特性包括：
 
 ### 系统配置要求
 
-- Python 版本 >= 3.6
+- Python 版本 >= 3.7
 - PyTorch 版本 >= 1.7.1
 
 请根据如下链接来安装PyTorch (https://github.com/pytorch/pytorch#installation)。
@@ -81,34 +83,30 @@ pip install -e .
 from cogdl import experiment
 
 # basic usage
-experiment(task="node_classification", dataset="cora", model="gcn")
+experiment(dataset="cora", model="gcn")
 
 # set other hyper-parameters
-experiment(task="node_classification", dataset="cora", model="gcn", hidden_size=32, max_epoch=200)
+experiment(dataset="cora", model="gcn", hidden_size=32, max_epoch=200)
 
 # run over multiple models on different seeds
-experiment(task="node_classification", dataset="cora", model=["gcn", "gat"], seed=[1, 2])
+experiment(dataset="cora", model=["gcn", "gat"], seed=[1, 2])
 
 # automl usage
-def func_search(trial):
+def search_space(trial):
     return {
         "lr": trial.suggest_categorical("lr", [1e-3, 5e-3, 1e-2]),
         "hidden_size": trial.suggest_categorical("hidden_size", [32, 64, 128]),
         "dropout": trial.suggest_uniform("dropout", 0.5, 0.8),
     }
 
-experiment(task="node_classification", dataset="cora", model="gcn", seed=[1, 2], func_search=func_search)
+experiment(dataset="cora", model="gcn", seed=[1, 2], search_space=search_space)
 ```
 
 您也可以通过`pipeline`接口来跑一些有趣的应用。下面这个例子能够在[pipeline.py](https://github.com/THUDM/cogdl/tree/master/examples/pipeline.py)文件中找到。
 
 ```python
 from cogdl import pipeline
 
-# print the statistics of datasets
-stats = pipeline("dataset-stats")
-stats(["cora", "citeseer"])
-
 # load OAGBert model and perform inference
 oagbert = pipeline("oagbert")
 outputs = oagbert(["CogDL is developed by KEG, Tsinghua.", "OAGBert is developed by KEG, Tsinghua."])
@@ -117,24 +115,23 @@ outputs = oagbert(["CogDL is developed by KEG, Tsinghua.", "OAGBert is developed
 有关OAGBert更多的用法可以参见[这里](./cogdl/oag/README.md).
 
 ### 命令行
-基本用法可以使用 `python train.py --task example_task --dataset example_dataset --model example_model` 来在 `example_data` 上运行 `example_model` 并使用 `example_task` 来评测结果。
+基本用法可以使用 `python train.py --dataset example_dataset --model example_model` 来在 `example_data` 上运行 `example_model`。
 
-- --task, 运行的任务名称，像`node_classification`, `unsupervised_node_classification`, `graph_classification`这样来评测模型性能的下游任务。
 - --dataset, 运行的数据集名称，可以是以空格分隔开的数据集名称的列表,现在支持的数据集包括 cora, citeseer, pumbed, ppi, wikipedia, blogcatalog, dblp, flickr等。
 - --model, 运行的模型名称,可以是个列表，支持的模型包括 gcn, gat, deepwalk, node2vec, hope, grarep, netmf, netsmf, prone等。
 
-如果你想在 Wikipedia 数据集上运行 LINE 和 NetMF 模型并且设置5个不同的随机种子，你可以使用如下的命令
+如果你想在 Cora 数据集上运行 GCN 和 GAT 模型并且设置5个不同的随机种子，你可以使用如下的命令
 
 ```bash
-$ python scripts/train.py --task unsupervised_node_classification --dataset wikipedia --model line netmf --seed 0 1 2 3 4
+python scripts/train.py --dataset cora --model gcn gat --seed 0 1 2 3 4
 ```
 
 预计得到的结果如下：
 
-| Variant                | Micro-F1 0.1   | Micro-F1 0.3   | Micro-F1 0.5   | Micro-F1 0.7   | Micro-F1 0.9   |
-|------------------------|----------------|----------------|----------------|----------------|----------------|
-| ('wikipedia', 'line')  | 0.4069±0.0011  | 0.4071±0.0010  | 0.4055±0.0013  | 0.4054±0.0020  | 0.4080±0.0042  |
-| ('wikipedia', 'netmf') | 0.4551±0.0024  | 0.4932±0.0022  | 0.5046±0.0017  | 0.5084±0.0057  | 0.5125±0.0035  |
+| Variant          | test_acc       | val_acc        |
+|------------------|----------------|----------------|
+| ('cora', 'gcn')  | 0.8050±0.0047  | 0.7940±0.0063  |
+| ('cora', 'gat')  | 0.8234±0.0042  | 0.8088±0.0016  |
 
 如果您在我们的工具包或自定义步骤中遇到任何困难，请随时提出一个github issue或发表评论。您可以在24小时内得到答复。
 
@@ -223,7 +220,7 @@ git clone https://github.com/THUDM/cogdl /cogdl
 </details>
 
 ## CogDL团队
-CogDL是由[清华, 北京智源, 阿里达摩院, 智谱.AI](https://cogdl.ai/zh/about/)开发并维护。
+CogDL是由[清华大学, 浙江大学, 北京智源, 阿里达摩院, 智谱.AI](https://cogdl.ai/zh/about/)开发并维护。
 
 CogDL核心开发团队可以通过[cogdlteam@gmail.com](mailto:cogdlteam@gmail.com)这个邮箱来联系。
 

diff --git a/cogdl/__init__.py b/cogdl/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "0.4.1"
+__version__ = "0.5.0b1"
 
 from .experiments import experiment
 from .oag import oagbert

diff --git a/cogdl/data/__init__.py b/cogdl/data/__init__.py
@@ -1,6 +1,6 @@
 from .data import Graph, Adjacency
-from .batch import Batch
+from .batch import Batch, batch_graphs
 from .dataset import Dataset, MultiGraphDataset
 from .dataloader import DataLoader
 
-__all__ = ["Graph", "Adjacency", "Batch", "Dataset", "DataLoader", "MultiGraphDataset"]
+__all__ = ["Graph", "Adjacency", "Batch", "Dataset", "DataLoader", "MultiGraphDataset", "batch_graphs"]
diff --git a/cogdl/data/batch.py b/cogdl/data/batch.py
@@ -4,6 +4,10 @@
 from cogdl.data import Graph, Adjacency
 
 
+def batch_graphs(graphs):
+    return Batch.from_data_list(graphs, class_type=Graph)
+
+
 class Batch(Graph):
     r"""A plain old python object modeling a batch of graphs as one big
     (dicconnected) graph. With :class:`cogdl.data.Data` being the
@@ -19,7 +23,7 @@ def __init__(self, batch=None, **kwargs):
         self.__slices__ = None
 
     @staticmethod
-    def from_data_list(data_list):
+    def from_data_list(data_list, class_type=None):
         r"""Constructs a batch object from a python list holding
         :class:`cogdl.data.Data` objects.
         The assignment vector :obj:`batch` is created on the fly.
@@ -31,8 +35,11 @@ def from_data_list(data_list):
         keys = list(set.union(*keys))
         assert "batch" not in keys
 
-        batch = Batch()
-        batch.__data_class__ = data_list[0].__class__
+        if class_type is not None:
+            batch = class_type()
+        else:
+            batch = Batch()
+            batch.__data_class__ = data_list[0].__class__
         batch.__slices__ = {key: [0] for key in keys}
 
         for key in keys: