# **CS224W - Colab 2**

此 Colab 将使用 PyTorch Geometric (PyG) 构建自己的图神经网络，并将模型应用于两个开放图基准 (Open Graph Benchmark，OGB) 数据集。这两个数据集用于在两个不同的图相关任务上，对模型性能进行基准测试：
- 一种是节点属性预测，预测单个节点的属性。
- 另一种是图属性预测，预测整个图或子图。

首先，我们将学习 PyTorch Geometric 如何在 PyTorch 张量中存储图形。

然后，我们将使用 `ogb` 包加载并快速查看其中一个开放图谱基准 (OGB) 数据集。 OGB 是用于图机器学习的真实、大规模和多样化的基准数据集的集合。 `ogb` 包不仅提供数据集的数据加载器，还提供评估器。

最后，我们将使用 PyTorch Geometric 构建我们自己的图神经网络。然后在节点属性预测和 grpah 属性预测任务上应用和评估模型。

**注意**：确保**依次运行每个部分中的所有单元格**，以便中间变量/包会延续到下一个单元格

在 Colab 2 上玩得开心 0v0


In this Colab, we will construct our own graph neural network by using PyTorch Geometric (PyG) and apply the model on two of Open Graph Benchmark (OGB) datasets. Those two datasets are used to benchmark the model performance on two different graph-related tasks. One is node property prediction, predicting properties of single nodes. Another one is graph property prediction, predicting the entire graphs or subgraphs.

At first, we will learn how PyTorch Geometric stores the graphs in PyTorch tensor.

We will then load and take a quick look on one of the Open Graph Benchmark (OGB) datasets by using the `ogb` package. OGB is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. The `ogb` package not only provides the data loader of the dataset but also the evaluator.

At last, we will build our own graph neural networks by using PyTorch Geometric. And then apply and evaluate the models on node property prediction and grpah property prediction tasks.

**Note**: Make sure to **sequentially run all the cells in each section**, so that the intermediate variables / packages will carry over to the next cell

Have fun on Colab 2 :)

# 设备
这次运算量比较大，推荐使用GPU。

# Device
You might need to use GPU for this Colab.

Please click `Runtime` and then `Change runtime type`. Then set the `hardware accelerator` to **GPU**.

# 安装依赖库
# Installation

In [1]:
!pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
!pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
!pip install -q torch-geometric
!pip install ogb



# 1 PyTorch Geometric (Datasets and Data)

PyTorch Geometric 通常有两个用于存储图、或将图转换为 Tensor 格式的类。 一个是 `torch_geometric.datasets`，它包含各种常见的图数据集。 另一个是`torch_geometric.data`，它提供了 PyTorch Tensor 中图数据的处理。

在本节中，我们将学习如何使用 `torch_geometric.datasets` 和 `torch_geometric.data`。

PyTorch Geometric generally has two classes for storing or transforming the graphs into tensor format. One is the `torch_geometric.datasets`, which contains a variety of common graph datasets. Another one is `torch_geometric.data` that provides the data handling of graphs in PyTorch tensors.

In this section, we will learn how to use the `torch_geometric.datasets` and `torch_geometric.data`.

## PyG Datasets
`torch_geometric.datasets` 里有很多常见的图数据集，这里将通过一个样例数据集来探索 `torch_geometric.datasets` 的使用方法

The `torch_geometric.datasets` has many common graph datasets. Here we will explore the usage by using one example dataset.

In [2]:
from torch_geometric.datasets import TUDataset

root = './enzymes'
name = 'ENZYMES'

# The ENZYMES dataset
pyg_dataset= TUDataset('./enzymes', 'ENZYMES')

# You can find that there are 600 graphs in this dataset
# 这个数据集一共有 600 张图！
print(pyg_dataset)

ENZYMES(600)


## 问题 1：ENZYMES 数据集中的类数和特征数是多少？

代码参考：https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#common-benchmark-datasets


## Question 1: What is the number of classes and number of features in the ENZYMES dataset? (5 points)

In [3]:
def get_num_classes(pyg_dataset):
    # TODO: Implement this function that takes a PyG dataset object
    # and return the number of classes for that dataset.
    '''
    TODO: 实现函数，该函数接收 PyG 数据集对象作为输入，返回数据集的类别
    '''
    num_classes = 0

    ############# Your code here ############
    ## (~1 line of code)
    ## Note
    ## 1. Colab autocomplete functionality might be useful.
    '''
    一行代码即可，你可以试试自动补全代码会不会自己跳出来答案233
    '''
    num_classes = pyg_dataset.num_classes
    #########################################

    return num_classes

def get_num_features(pyg_dataset):
    # TODO: Implement this function that takes a PyG dataset object
    # and return the number of features for that dataset.
    '''
    TODO：实现函数，输入是 PyG dataset 对象，返回特征的数量
    https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Dataset.num_node_features
    '''
    num_features = 0

    ############# Your code here ############
    ## (~1 line of code)
    ## Note
    ## 1. Colab autocomplete functionality might be useful.
    num_features = pyg_dataset.num_node_features
    #########################################

    return num_features

# You may find that some information need to be stored in the dataset level,
# specifically if there are multiple graphs in the dataset

num_classes = get_num_classes(pyg_dataset)
num_features = get_num_features(pyg_dataset)
print("{} dataset has {} classes".format(name, num_classes))
print("{} dataset has {} features".format(name, num_features))

ENZYMES dataset has 6 classes
ENZYMES dataset has 3 features


## PyG Data
每个 PyG 数据集通常存储一个 `torch_geometric.data.Data` 对象的列表。 每个 `torch_geometric.data.Data` 对象通常代表一张图。 您可以通过对数据集进行索引来轻松获取 `Data` 对象。

更多信息，比如在 `Data` 对象中都有啥，请参阅[官方文档](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data).


Each PyG dataset usually stores a list of `torch_geometric.data.Data` objects. Each `torch_geometric.data.Data` object usually represents a graph. You can easily get the `Data` object by indexing on the dataset.

For more information such as what will be stored in `Data` object, please refer to the [documentation](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data).

In [4]:
# 大概能搞出这些~
print("这是一张图", pyg_dataset[0])
print("节点数", pyg_dataset[0].num_nodes)
print("这是边的索引", (pyg_dataset[0].edge_index).shape)
print("这是边的个数", pyg_dataset[0].num_edges)
print("这是点的特征吗？", pyg_dataset[0].x)
print("这是图的标签", pyg_dataset[0].y)

for graph in pyg_dataset:
    if graph.num_edges % 2 == 1:
        print("无向图的边不会重复记录两次，没事的")
        break
print("PyG无向图的边实际上会重复记录两次")

这是一张图 Data(edge_index=[2, 168], x=[37, 3], y=[1])
节点数 37
这是边的索引 torch.Size([2, 168])
这是边的个数 168
这是点的特征吗？ tensor([[1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.]])
这是图的标签 tensor([5])
PyG无向图的边实际上会重复记录两次


## 问题 2：（ENZYMES 中索引为 100）图的标签是什么？
## Question 2: What is the label of the graph (index 100 in the ENZYMES dataset)? (5 points)

In [5]:
def get_graph_class(pyg_dataset, idx):
    # TODO: Implement this function that takes a PyG dataset object,
    # the index of the graph in dataset, and returns the class/label 
    # of the graph (in integer).
    '''
    实现函数，输入为 PyG dataset 对象，以及图在 dataset 中的索引
    返回对应图的类别（一个整数）
    '''
    label = -1

    ############# Your code here ############
    ## (~1 line of code)
    '''
    一行代码就行，记得最后加 .item() 因为只要数字
    '''
    label = pyg_dataset[idx].y.item()
    #########################################

    return label

# Here pyg_dataset is a dataset for graph classification
# 这里的 pyg_dataset 是一个用于图分类的数据集
graph_0 = pyg_dataset[0]
print(graph_0)
idx = 100
label = get_graph_class(pyg_dataset, idx)
print('Graph with index {} has label {}'.format(idx, label))

Data(edge_index=[2, 168], x=[37, 3], y=[1])
Graph with index 100 has label 4


## 问题 3： （ENZYMES 中索引为 100）图有多少条边？
PyG 好像默认会把无向图的一条边记两次
## Question 3: What is the number of edges for the graph (index 200 in the ENZYMES dataset)? (5 points)

In [6]:
def get_graph_num_edges(pyg_dataset, idx):
    # TODO: Implement this function that takes a PyG dataset object,
    # the index of the graph in dataset, and returns the number of 
    # edges in the graph (in integer). You should not count an edge 
    # twice if the graph is undirected. For example, in an undirected 
    # graph G, if two nodes v and u are connected by an edge, this edge
    # should only be counted once.
    '''
    实现函数，接收 PyG dataset 对象和索引作为输入，返回对应图的边数（整数）
    如果图是无向图，那么不能把每条边计算两次。（比如(u,v)(v,u)是一条边）
    '''
    num_edges = 0

    ############# Your code here ############
    ## Note:
    ## 1. You can't return the data.num_edges directly
    ## 2. We assume the graph is undirected
    ## (~4 lines of code)
    '''
    注意，不能用 data.num_edges 直接得到，假设图是无向图
    至多四行代码就够了
    '''
    num_edges = int(pyg_dataset[idx].edge_index.shape[-1] / 2)
    #########################################

    return num_edges

idx = 200
num_edges = get_graph_num_edges(pyg_dataset, idx)
print('Graph with index {} has {} edges'.format(idx, num_edges))

Graph with index 200 has 53 edges


# 2 Open Graph Benchmark (OGB)
OGB 是用于图机器学习的真实、大规模和多样化的基准数据集的集合。

`ogb` 包的数据集可以自动化下载、处理、分割并用于 `ogb` 的 Data Loader。模型的性能也可以被 `ogb` 的 Evaluator 统一评估。


# 2 Open Graph Benchmark (OGB)

The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. Its datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can also be evaluated by using the OGB Evaluator in a unified manner.

## Dataset and Data
OGB 也支持 PyG Dataset 和 Data。这里看一看 `ogbn-arxiv` 数据集

OGB also supports the PyG dataset and data. Here we take a look on the `ogbn-arxiv` dataset.

In [7]:
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset

dataset_name = 'ogbn-arxiv'
# Load the dataset and transform it to sparse tensor
# 加载数据集，转为稀疏张量
dataset = PygNodePropPredDataset(name=dataset_name,
                                 transform=T.ToSparseTensor())
print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))

# Extract the graph
# 提取图
data = dataset[0]
print(data)

The ogbn-arxiv dataset has 1 graph
Data(adj_t=[169343, 169343, nnz=1166243], node_year=[169343, 1], x=[169343, 128], y=[169343, 1])


## 问题 4: ogbn-arxiv 中的特征数为多少？
## Question 4: What is the number of features in the ogbn-arxiv graph? (5 points)

In [8]:
def graph_num_features(data):
    # TODO: Implement this function that takes a PyG data object,
    # and returns the number of features in the graph (in integer).
    '''
    TODO: 实现函数，接收 PyG data 对象为输入，返回图中的特征量（整数）
    '''
    num_features = 0

    ############# Your code here ############
    ## (~1 line of code)
    num_features = data.num_features
    #########################################

    return num_features

num_features = graph_num_features(data)
print('The graph has {} features'.format(num_features))

The graph has 128 features


# 3 GNN: 节点属性预测

在本节中，我们将使用 PyTorch Geometric 构建我们的第一个图神经网络，并将其应用于节点属性预测（节点分类）。

我们将使用 GCN 算子构建图神经网络 ([Kipf et al. (2017)](https://arxiv.org/pdf/1609.02907.pdf)).

需要直接使用 PyG 内置的 `GCNConv` 层。


# 3 GNN: Node Property Prediction

In this section we will build our first graph neural network by using PyTorch Geometric and apply it on node property prediction (node classification).

We will build the graph neural network by using GCN operator ([Kipf et al. (2017)](https://arxiv.org/pdf/1609.02907.pdf)).

You should use the PyG built-in `GCNConv` layer directly. 

## Setup

In [9]:
import torch
import torch.nn.functional as F
print(torch.__version__)

# The PyG built-in GCNConv
# PyG 内置的 GCNConv
from torch_geometric.nn import GCNConv

import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

1.7.1


## 加载、预处理数据集
## Load and Preprocess the Dataset

In [29]:
dataset_name = 'ogbn-arxiv'
dataset = PygNodePropPredDataset(name=dataset_name,
                                 transform=T.ToSparseTensor())
data = dataset[0]

# Make the adjacency matrix to symmetric
# 使邻接矩阵对称
data.adj_t = data.adj_t.to_symmetric()

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# If you use GPU, the device should be cuda
print('Device: {}'.format(device))

data = data.to(device)
split_idx = dataset.get_idx_split()
train_idx = split_idx['train'].to(device)

Device: cuda


## GCN Model
现在开始实现自己的 GCN 模型吧！

注意，这里作业要求使用 `torch.nn.ModuleList()` 实现模型，下面首先给出官方文档中它的用法

请根据下图实现你的 `forward` 函数


Now we will implement our GCN model!

Please follow the figure below to implement your `forward` function.


![picture](img/cs224w-colab2-3.png)

In [39]:
class MyModule(torch.nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
        self.linears = torch.nn.ModuleList([torch.nn.Linear(10, 10) for i in range(10)])

    def forward(self, x):
        # ModuleList can act as an iterable, or be indexed using ints
        for i, l in enumerate(self.linears):
            x = self.linears[i // 2](x) + l(x)  #//是整数除法
        return x

Modulelist = MyModule()
Modulelist

MyModule(
  (linears): ModuleList(
    (0): Linear(in_features=10, out_features=10, bias=True)
    (1): Linear(in_features=10, out_features=10, bias=True)
    (2): Linear(in_features=10, out_features=10, bias=True)
    (3): Linear(in_features=10, out_features=10, bias=True)
    (4): Linear(in_features=10, out_features=10, bias=True)
    (5): Linear(in_features=10, out_features=10, bias=True)
    (6): Linear(in_features=10, out_features=10, bias=True)
    (7): Linear(in_features=10, out_features=10, bias=True)
    (8): Linear(in_features=10, out_features=10, bias=True)
    (9): Linear(in_features=10, out_features=10, bias=True)
  )
)

In [30]:
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout, return_embeds=False):
        # TODO: Implement this function that initializes self.convs, 
        # self.bns, and self.softmax.
        '''
        实现 self.convs, self.bns, self.softmax 以实现初始化函数
        '''
        super(GCN, self).__init__()

        # A list of GCNConv layers
        # GCNConv 层的 Module List
        self.convs = None

        # A list of 1D batch normalization layers
        # 1 维 BN 的 Module List
        self.bns = None

        # The log softmax layer
        # log SoftMax 层
        self.softmax = None

        ############# Your code here ############
        ## Note:
        ## 1. You should use torch.nn.ModuleList for self.convs and self.bns
        ## 2. self.convs has num_layers GCNConv layers
        ## 3. self.bns has num_layers - 1 BatchNorm1d layers
        ## 4. You should use torch.nn.LogSoftmax for self.softmax
        ## 5. The parameters you can set for GCNConv include 'in_channels' and 
        ## 'out_channels'. More information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv
        ## 6. The only parameter you need to set for BatchNorm1d is 'num_features'
        ## More information please refer to the documentation: 
        ## https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
        ## (~10 lines of code)
        '''
        至多 10 行代码，注意
        1. 对于 self.bns 和 self.convs 需使用 torch.nn.MoudleList
        2. self.convs 有 num_layers 个 GCNConv layers
        3. self.bn 有 num_layers - 1 个 BatchNorm1d 层
        4. self.softmax 需使用 torch.nn.LogSoftmax
        5. 你可以设置 GCNConv 的 'in_channels' 和 'out_channels' 参数，具体请参考文档：
        https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv
        6. 你对于 BatchNorm1d 你只需要设置 'num_features' 参数，具体请参考文档：
        https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
        '''
        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(input_dim, hidden_dim))
        for i in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_dim, hidden_dim) )
        self.convs.append(GCNConv(hidden_dim, output_dim))
        
        self.bns = torch.nn.ModuleList([torch.nn.BatchNorm1d(hidden_dim) for i in range(num_layers - 1)])
        
        self.softmax = torch.nn.LogSoftmax()
        
        #########################################

        # Probability of an element to be zeroed
        # Dropout 中令元素置零的概率
        self.dropout = dropout

        # Skip classification layer and return node embeddings
        # 跳过分类层， 直接返回节点嵌入
        self.return_embeds = return_embeds

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t):
        # TODO: Implement this function that takes the feature tensor x,
        # edge_index tensor adj_t and returns the output tensor as
        # shown in the figure.
        '''
        TODO: 实现函数，接收特征 Tensor x 和 edge_index Tensor adj_t 作为输入，
        返回输出 Tensor
        '''
        out = None

        ############# Your code here ############
        ## Note:
        ## 1. Construct the network as showing in the figure
        ## 2. torch.nn.functional.relu and torch.nn.functional.dropout are useful
        ## More information please refer to the documentation:
        ## https://pytorch.org/docs/stable/nn.functional.html
        ## 3. Don't forget to set F.dropout training to self.training
        ## 4. If return_embeds is True, then skip the last softmax layer
        ## (~7 lines of code)
        '''
        至多 7 行代码，注意：
        1. 请如图所示构建网络
        2. 可以用 torch.nn.functional.relu 和 torch.nn.functional.dropout 
        具体可查看官方文档 https://pytorch.org/docs/stable/nn.functional.html
        3. 别忘了给 F.dropout 的 training 设置为 self.training
        4. 如果 return_embeds 是 True，则跳过最后的 softmax 层
        '''
        for i in range(len(self.convs) - 1): # 遍历除了输出之外每一层
            x = self.convs[i](x, adj_t)
            
            x = self.bns[i](x)
            x = F.relu(x)
            x = F.dropout(x, self.dropout, self.training)
        
        out = self.convs[-1](x, adj_t)
        if not self.return_embeds:
            out = self.softmax(out)
        #########################################

        return out

In [31]:
def train(model, data, train_idx, optimizer, loss_fn):
    # TODO: Implement this function that trains the model by 
    # using the given optimizer and loss_fn.
    '''
    TODO：实现函数，使用给定的 optimizer 和 loss_fn 训练模型
    '''
    model.train()
    loss = 0

    ############# Your code here ############
    ## Note:
    ## 1. Zero grad the optimizer
    ## 2. Feed the data into the model
    ## 3. Slicing the model output and label by train_idx
    ## 4. Feed the sliced output and label to loss_fn
    ## (~4 lines of code)
    '''
    至多 4 行代码，注意
    1. 别忘了清空 optimizer 的梯度
    2. 把输入喂给模型
    3. 用 train_idx 划分模型的输出和标签
    4. 把划分好的输出与label喂给 loss_fn
    '''
    optimizer.zero_grad()
    out=model(data.x,data.adj_t)
    train_output=out[train_idx]
    train_label=data.y[train_idx,0]
    loss=loss_fn(train_output,train_label)
    #########################################

    loss.backward()
    optimizer.step()

    return loss.item()

In [32]:
# Test function here
# 此处为测试函数

@torch.no_grad()
def test(model, data, split_idx, evaluator):
    # TODO: Implement this function that tests the model by 
    # using the given split_idx and evaluator.
    '''
    TODO: 实现函数，使用给定的 split_idx 与 evaluator 测试模型
    '''
    # 把模型设置为测试模式
    model.eval()

    # The output of model on all data
    # 在所有数据上模型的输出
    out = None

    ############# Your code here ############
    ## (~1 line of code)
    ## Note:
    ## 1. No index slicing here
    '''
    一行代码，注意不需要划分 index
    '''
    out=model(data.x,data.adj_t)
    #########################################

    y_pred = out.argmax(dim=-1, keepdim=True)

    train_acc = evaluator.eval({
        'y_true': data.y[split_idx['train']],
        'y_pred': y_pred[split_idx['train']],
    })['acc']
    valid_acc = evaluator.eval({
        'y_true': data.y[split_idx['valid']],
        'y_pred': y_pred[split_idx['valid']],
    })['acc']
    test_acc = evaluator.eval({
        'y_true': data.y[split_idx['test']],
        'y_pred': y_pred[split_idx['test']],
    })['acc']

    return train_acc, valid_acc, test_acc

EX：那么什么是 evaluator 呢？

In [33]:
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator
evaluator = Evaluator(name='ogbn-arxiv')
print(evaluator.expected_output_format) 

==== Expected output format of Evaluator for ogbn-arxiv
{'acc': acc}
- acc (float): Accuracy score averaged across 1 task(s)



In [34]:
# Please do not change the args
# 这块的参数别改哦~
args = {
    'device': device,
    'num_layers': 3,
    'hidden_dim': 256,
    'dropout': 0.5,
    'lr': 0.01,
    'epochs': 100,
}
args

{'device': 'cuda',
 'num_layers': 3,
 'hidden_dim': 256,
 'dropout': 0.5,
 'lr': 0.01,
 'epochs': 100}

In [35]:
model = GCN(data.num_features, args['hidden_dim'],
            dataset.num_classes, args['num_layers'],
            args['dropout']).to(device)
evaluator = Evaluator(name='ogbn-arxiv')

In [36]:
import copy

# reset the parameters to initial random value
# 重新初始化模型参数
model.reset_parameters()

optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
loss_fn = F.nll_loss

best_model = None
best_valid_acc = 0

for epoch in range(1, 1 + args["epochs"]):
    loss = train(model, data, train_idx, optimizer, loss_fn)
    result = test(model, data, split_idx, evaluator)
    train_acc, valid_acc, test_acc = result
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model)
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')

  out = self.softmax(out)


Epoch: 01, Loss: 4.1763, Train: 11.09%, Valid: 22.98% Test: 21.57%
Epoch: 02, Loss: 2.2836, Train: 22.72%, Valid: 21.46% Test: 26.69%
Epoch: 03, Loss: 1.9543, Train: 36.72%, Valid: 42.47% Test: 40.82%
Epoch: 04, Loss: 1.7791, Train: 31.18%, Valid: 28.31% Test: 29.25%
Epoch: 05, Loss: 1.6504, Train: 32.05%, Valid: 25.36% Test: 28.44%
Epoch: 06, Loss: 1.5590, Train: 33.33%, Valid: 26.89% Test: 30.07%
Epoch: 07, Loss: 1.4938, Train: 35.14%, Valid: 31.43% Test: 34.76%
Epoch: 08, Loss: 1.4353, Train: 37.29%, Valid: 36.55% Test: 39.09%
Epoch: 09, Loss: 1.3971, Train: 37.01%, Valid: 34.89% Test: 37.50%
Epoch: 10, Loss: 1.3731, Train: 36.42%, Valid: 32.26% Test: 35.76%
Epoch: 11, Loss: 1.3375, Train: 36.61%, Valid: 30.84% Test: 35.00%
Epoch: 12, Loss: 1.3091, Train: 37.33%, Valid: 31.22% Test: 35.67%
Epoch: 13, Loss: 1.2815, Train: 39.05%, Valid: 34.70% Test: 39.63%
Epoch: 14, Loss: 1.2674, Train: 42.16%, Valid: 40.46% Test: 45.45%
Epoch: 15, Loss: 1.2433, Train: 45.28%, Valid: 45.73% Test: 50

In [37]:
best_result = test(best_model, data, split_idx, evaluator)
train_acc, valid_acc, test_acc = best_result
print(f'Best model: '
      f'Train: {100 * train_acc:.2f}%, '
      f'Valid: {100 * valid_acc:.2f}% '
      f'Test: {100 * test_acc:.2f}%')

Best model: Train: 73.42%, Valid: 71.83% Test: 70.83%


  out = self.softmax(out)


## 问题 5：`best_model` 的验证、测试准确率有多好？

## Question 5: What are your `best_model` validation and test accuracy? Please report them on Gradescope. For example, for an accuracy such as 50.01%, just report 50.01 and please don't include the percent sign. (20 points)

# 4 GNN:图属性预测

本章节将创建一个用于图属性预测的 GNN （图分类）

# 4 GNN: Graph Property Prediction

In this section we will create a graph neural network for graph property prediction (graph classification)


## 加载、预处理数据集

`dataset.task_type` 可以直接查看任务类型哦！
## Load and preprocess the dataset

In [38]:
from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from torch_geometric.data import DataLoader
from tqdm.notebook import tqdm

# Load the dataset 
dataset = PygGraphPropPredDataset(name='ogbg-molhiv')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device: {}'.format(device))

split_idx = dataset.get_idx_split()

# Check task type
# 这个地方很实用，可以直接查看任务的类型
print('Task type: {}'.format(dataset.task_type))

Device: cuda
Task type: binary classification


In [19]:
# Load the data sets into dataloader
# We will train the graph classification task on a batch of 32 graphs
# Shuffle the order of graphs for training set
'''
给 dataloader 加载数据，我们将在 32 大小的 batch 上训练图的分类，并在训练集上 Shuffle 图
'''
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, num_workers=0)

In [45]:
# Please do not change the args
# 虽然他说不给改，但是我为了快把 epoch 从 30 改成 5 了
args = {
    'device': device,
    'num_layers': 5,
    'hidden_dim': 256,
    'dropout': 0.5,
    'lr': 0.001,
    'epochs': 5,
}
args

{'device': 'cuda',
 'num_layers': 5,
 'hidden_dim': 256,
 'dropout': 0.5,
 'lr': 0.001,
 'epochs': 5}

## 图预测模型
现在我们开始实现 GCN 图预测模型！

我们将复用现有的 GCN 模型以生成 `node_embeddings` 并对节点使用全局池化以预测整张图的性质。

注意，下面作业需要的池化层已经在第二行中加载好了。

## Graph Prediction Model

Now we will implement our GCN Graph Prediction model!

We will reuse the existing GCN model to generate `node_embeddings` and use  Global Pooling on the nodes to predict properties for the whole graph.

In [46]:
from ogb.graphproppred.mol_encoder import AtomEncoder
from torch_geometric.nn import global_add_pool, global_mean_pool # 池化层已经在这加载好啦

### GCN to predict graph property
# 用于预测图属性的 GCN
class GCN_Graph(torch.nn.Module):
    def __init__(self, hidden_dim, output_dim, num_layers, dropout):
        super(GCN_Graph, self).__init__()

        # Load encoders for Atoms in molecule graphs
        # 为分子图 (molecule graphs) 中的原子 (Atom) 加载编码器
        self.node_encoder = AtomEncoder(hidden_dim)

        # Node embedding model
        # Note that the input_dim and output_dim are set to hidden_dim
        # 节点嵌入模型，注意，input_dim 和 output_dim 都被设置成了 hidden_dim
        self.gnn_node = GCN(hidden_dim, hidden_dim,
            hidden_dim, num_layers, dropout, return_embeds=True)

        self.pool = None

        ############# Your code here ############
        ## Note:
        ## 1. Initialize the self.pool to global mean pooling layer
        ## More information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        ## (~1 line of code)
        '''
        1 行代码：初始化 self.pool 为全局平均池化层，详情请参考官方文档
        https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        '''
        #########################################
        self.pool = global_mean_pool
        # Output layer
        self.linear = torch.nn.Linear(hidden_dim, output_dim)


    def reset_parameters(self):
        self.gnn_node.reset_parameters()
        self.linear.reset_parameters()

    def forward(self, batched_data):
        # TODO: Implement this function that takes the input tensor batched_data,
        # returns a batched output tensor for each graph.
        '''
        TODO: 实现函数，接收输入 Tensor batched_data，为每个图一个 batch 的输出
        '''
        x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
        embed = self.node_encoder(x)

        out = None

        ############# Your code here ############
        ## Note:
        ## 1. Construct node embeddings using existing GCN model
        ## 2. Use global pooling layer to construct features for the whole graph
        ## More information please refer to the documentation:
        ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        ## 3. Use a linear layer to predict the graph property 
        ## (~3 lines of code)
        '''
        至多 3 行，注意：
        1. 使用现有的 GCN 模型构建节点嵌入
        2. 使用全局池化层构建整张图的特征，参考https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
        3. 使用一个线性层预测图的属性
        '''
        out = self.gnn_node(embed, edge_index)
        out = self.pool(out, batch)
        out = self.linear(out)
        #########################################

        return out

In [47]:
def train(model, device, data_loader, optimizer, loss_fn):
    # TODO: Implement this function that trains the model by 
    # using the given optimizer and loss_fn.
    '''
    TODO: 给定 optimizer 和 loss_fn 实现训练函数
    '''
    model.train()
    loss = 0

    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
        batch = batch.to(device)

        if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
            pass
        else:
            ## ignore nan targets (unlabeled) when computing training loss.
            ## 在计算训练 loss 时，忽视 nan 的目标（没标注）
            is_labeled = batch.y == batch.y

            ############# Your code here ############
            ## Note:
            ## 1. Zero grad the optimizer
            ## 2. Feed the data into the model
            ## 3. Use `is_labeled` mask to filter output and labels
            ## 4. You might change the type of label
            ## 5. Feed the output and label to loss_fn
            ## (~3 lines of code)
            '''
            至多 3 行代码：
            1. 清空 optimizer 的梯度
            2. 数据喂给模型
            3. 使用 'is_labeled' mask 过滤输出和 label
            4. 你可能需要改变 label 的类型
            5. 把输出和 label 喂给 loss_fn
            '''
            optimizer.zero_grad()
            op=model(batch)
            train_op=op[is_labeled]
            #train_labels=batch.y[is_labeled]  #Warning: 答案这里用了view，我参考了下面的y.view，就把view加上了，其实我不知道具体为什么要加
            train_labels=batch.y[is_labeled].view(-1)
            #loss=loss_fn(train_op,train_labels)  #RuntimeError: result type Float can't be cast to the desired output type Long
            loss=loss_fn(train_op.float(),train_labels.float())  #https://pytorch.org/docs/stable/tensors.html#torch.Tensor.float        
            #########################################

            loss.backward()
            optimizer.step()

    return loss.item()

In [48]:
# The evaluation function
def eval(model, device, loader, evaluator):
    model.eval()
    y_true = []
    y_pred = []

    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
        batch = batch.to(device)

        if batch.x.shape[0] == 1:
            pass
        else:
            with torch.no_grad():
                pred = model(batch)

            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())

    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()

    input_dict = {"y_true": y_true, "y_pred": y_pred}

    return evaluator.eval(input_dict)

In [49]:
model = GCN_Graph(args['hidden_dim'],
            dataset.num_tasks, args['num_layers'],
            args['dropout']).to(device)
evaluator = Evaluator(name='ogbg-molhiv')

In [50]:
import copy

model.reset_parameters()

optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
loss_fn = torch.nn.BCEWithLogitsLoss()

best_model = None
best_valid_acc = 0

for epoch in range(1, 1 + args["epochs"]):
    print('Training...')
    loss = train(model, device, train_loader, optimizer, loss_fn)

    print('Evaluating...')
    train_result = eval(model, device, train_loader, evaluator)
    val_result = eval(model, device, valid_loader, evaluator)
    test_result = eval(model, device, test_loader, evaluator)

    train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model)
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')

Training...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Evaluating...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Epoch: 01, Loss: 0.0147, Train: 71.19%, Valid: 74.73% Test: 67.07%
Training...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Evaluating...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Epoch: 02, Loss: 0.8991, Train: 75.24%, Valid: 72.40% Test: 66.62%
Training...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Evaluating...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Epoch: 03, Loss: 0.4985, Train: 76.54%, Valid: 73.64% Test: 73.19%
Training...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Evaluating...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Epoch: 04, Loss: 0.5842, Train: 75.62%, Valid: 73.09% Test: 72.26%
Training...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Evaluating...


Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Epoch: 05, Loss: 0.0368, Train: 77.06%, Valid: 70.47% Test: 69.94%


In [51]:
train_acc = eval(best_model, device, train_loader, evaluator)[dataset.eval_metric]
valid_acc = eval(best_model, device, valid_loader, evaluator)[dataset.eval_metric]
test_acc = eval(best_model, device, test_loader, evaluator)[dataset.eval_metric]

print(f'Best model: '
      f'Train: {100 * train_acc:.2f}%, '
      f'Valid: {100 * valid_acc:.2f}% '
      f'Test: {100 * test_acc:.2f}%')

Iteration:   0%|          | 0/1029 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Iteration:   0%|          | 0/129 [00:00<?, ?it/s]

Best model: Train: 71.19%, Valid: 74.73% Test: 67.07%


## 问题 6: `best_model` 验证以及测试的 ROC—AUC 分数是多少？ 

## Question 6: What are your `best_model` validation and test ROC-AUC score? Please report them on Gradescope. For example, for an ROC-AUC score such as 50.01%, just report 50.01 and please don't include the percent sign. (20 points)

## 问题 7 : 试试 PyG 里除了均值池化的其他池化方式吧~

## Question 7 (Optional): Experiment with other two global pooling layers other than mean pooling in Pytorch Geometric.

In [52]:
# 等我把别的写完了搞个这个好了2333

# Submission

In order to get credit, you must go submit your answers on Gradescope.

Also, you need to submit the `ipynb` file of Colab 2, by clicking `File` and `Download .ipynb`. Please make sure that your output of each cell is available in your `ipynb` file.