首先，我们将了解 PyTorch Geometric 如何将图存储为 PyTorch 张量。

然后，我们将使用 ogb 包加载和检查其中一个 Open Graph Benchmark (OGB) 数据集。OGB 是用于图机器学习的现实、大规模和多样化的基准数据集的集合。ogb 包不仅为每个数据集提供数据加载器，还提供模型评估器。

最后，我们将使用 PyTorch Geometric 构建我们自己的 GNN。然后，我们将在 OGB 节点属性预测和图形属性预测任务上训练和评估我们的模型。

注意：确保按顺序运行每个部分中的所有单元，以便中间变量/包将延续到下一个单元 完成本次实验的时间约为两小时

# 环境搭建

In [1]:
import os

import torch
print("PyTorch has version {}".format(torch.__version__))

PyTorch has version 2.5.1+cu124


下载 PyG 的依赖，确保其与 torch 下载的版本契合，如果有问题可以查阅文档 [PyG's page](https://www.google.com/url?q=https%3A%2F%2Fpytorch-geometric.readthedocs.io%2Fen%2Flatest%2Fnotes%2Finstallation.html)

In [2]:
# 安装 torch geometric
import os
import torch
if 'IS_GRADESCOPE_ENV' not in os.environ:
  torch_version = str(torch.__version__)
  scatter_src = f"https://pytorch-geometric.com/whl/torch-{torch_version}.html"
  sparse_src = f"https://pytorch-geometric.com/whl/torch-{torch_version}.html"
  !pip install torch-scatter -f $scatter_src
  !pip install torch-sparse -f $sparse_src
  !pip install torch-geometric
  !pip install ogb

Looking in links: https://pytorch-geometric.com/whl/torch-2.5.1+cu124.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-2.5.0%2Bcu124/torch_scatter-2.1.2%2Bpt25cu124-cp310-cp310-linux_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.1.2+pt25cu124
[0mLooking in links: https://pytorch-geometric.com/whl/torch-2.5.1+cu124.html
Collecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-2.5.0%2Bcu124/torch_sparse-0.6.18%2Bpt25cu124-cp310-cp310-linux_x86_64.whl (5.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.18+pt25cu124
[0mCollecting torch-geometric
  Downloading torch_geometric-2.

# 1)PyG (数据集和数据)

PyTorch Geometric 有两个用于存储和/或将图转换为张量格式的类。
一个是 `torch_geometric.datasets`，它包含了各种常见的图数据集；
另一个是 `torch_geometric.data`，它提供了将图转换为 PyTorch 张量的相关数据处理功能。

在本节中，我们将学习如何将 `torch_geometric.datasets` 和 `torch_geometric.data` 结合使用。

## PyG 数据集

`torch_geometric.datasets` 类有许多图数据集，我们使用其一来探索其用法

In [3]:
from torch_geometric.datasets import TUDataset

if 'IS_GRADESCOPE_ENV' not in os.environ:
  root = './enzymes'
  name = 'ENZYMES'

  # ENZYMES(酶)数据集
  pyg_dataset= TUDataset(root, name)

  # 其中有六百个图
  print(pyg_dataset)

  from .autonotebook import tqdm as notebook_tqdm
Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Processing...


ENZYMES(600)


Done!


### Question1: ENZYMES 数据集中有多少类，多少特征

In [4]:
def get_num_classes(pyg_dataset):
  # TODO: 实现一个函数，接收一个 PyG 数据集对象，
  # 并返回该数据集的类别数量。

  num_classes = pyg_dataset.num_classes

  ############# Your code here ############
  ## (~1 行代码)
  ## 注意：
  ## 1. 自动补全功能可能会很有帮助。

  #########################################

  return num_classes

def get_num_features(pyg_dataset):
  # TODO: 实现一个函数，接收一个 PyG 数据集对象，
  # 并返回该数据集的特征数量。

  num_features = pyg_dataset.num_features

  ############# Your code here ############
  ## (~1 行代码)
  ## 注意：
  ## 1. 自动补全功能可能会很有帮助。

  #########################################

  return num_features

if 'IS_GRADESCOPE_ENV' not in os.environ:
  num_classes = get_num_classes(pyg_dataset)
  num_features = get_num_features(pyg_dataset)
  print("{} dataset has {} classes".format(name, num_classes))
  print("{} dataset has {} features".format(name, num_features))


ENZYMES dataset has 6 classes
ENZYMES dataset has 3 features


## PyG 数据

每个 PyG 数据集都存储了一个由 `torch_geometric.data.Data` 对象组成的列表，其中每个 `torch_geometric.data.Data` 对象表示一张图。

我们可以通过索引数据集获取 `Data` 对象。 
关于 `Data` 对象中包含哪些信息等更多内容，请参考[官方文档](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data)。

### Question 2： ENZYMES 数据集中 index 为 100 的图的 label 是什么？

In [5]:
def get_graph_class(pyg_dataset, idx):
  # TODO: 实现一个函数，接收一个 PyG 数据集对象，
  # 和一个图在数据集中的索引，返回该图的类别/标签（为一个整数）。

  label = pyg_dataset[idx].y.item()

  ############# Your code here ############
  ## (~1 行代码)

  #########################################

  return label

# 此处的 pyg_dataset 是用于图分类的数据集
if 'IS_GRADESCOPE_ENV' not in os.environ:
  graph_0 = pyg_dataset[0]
  print(graph_0)
  idx = 100
  label = get_graph_class(pyg_dataset, idx)
  print('Graph with index {} has label {}'.format(idx, label))


Data(edge_index=[2, 168], x=[37, 3], y=[1])
Graph with index 100 has label 4


### Question 3：index 为 200 的图有多少条边？

In [8]:
def get_graph_num_edges(pyg_dataset, idx):
  # TODO: 实现一个函数，接收一个 PyG 数据集对象，
  # 和该数据集中某个图的索引，返回该图中的边数（整数）。
  # 如果图是无向的，不应该重复计数边。
  # 例如，在一个无向图 G 中，若两个节点 v 和 u 之间有一条边，
  # 那么这条边只应该被计数一次。

  data = pyg_dataset[idx]         
  edge_index = data.edge_index     
  num_edges = edge_index.size(1) // 2

  ############# Your code here ############
  ## 注意：
  ## 1. 不能直接返回 data.num_edges
  ## 2. 我们假设图是无向的
  ## 3. 可以查看 PyG 数据集中自带的函数
  ## （大约 4 行代码）

  #########################################

  return num_edges

if 'IS_GRADESCOPE_ENV' not in os.environ:
  idx = 200
  num_edges = get_graph_num_edges(pyg_dataset, idx)
  print('Graph with index {} has {} edges'.format(idx, num_edges))


Graph with index 200 has 53 edges


# 2) Open Graph Benchmark(OGB)

**Open Graph Benchmark（OGB）** 是一个用于图机器学习的现实、大规模且多样化的基准数据集集合。

这些数据集可以通过 OGB 的数据加载器（OGB Data Loader）**自动下载、处理并划分**。

随后，可以使用 OGB 的评估器（OGB Evaluator）以统一的方式对模型性能进行评估。

如果数据集自动下载速度较慢，可以从Nju Box下载：https://box.nju.edu.cn/d/5f1c0015382643c9be0d/

## 数据集和数据

OGB 也支持 PyG 的数据集/数据的类。此处我们查看 `ogbn-arxiv` 数据集

In [9]:
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset



if 'IS_GRADESCOPE_ENV' not in os.environ:
  dataset_name = 'ogbn-arxiv'
  # 加载数据集并转换为稀疏图
  dataset = PygNodePropPredDataset(name=dataset_name,
                                  transform=T.ToSparseTensor())
  print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))

  # 分离一张图出来
  data = dataset[0]
  print(data)

Downloading http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip


Downloaded 0.08 GB: 100%|██████████| 81/81 [00:16<00:00,  4.85it/s]


Extracting dataset/arxiv.zip


Processing...


Loading necessary files...
This might take a while.
Processing graphs...


100%|██████████| 1/1 [00:00<00:00, 15363.75it/s]


Converting graphs into PyG objects...


100%|██████████| 1/1 [00:00<00:00, 46.43it/s]

Saving...



Done!
  self.data, self.slices = torch.load(self.processed_paths[0])


The ogbn-arxiv dataset has 1 graph
Data(num_nodes=169343, x=[169343, 128], node_year=[169343, 1], y=[169343, 1], adj_t=[169343, 169343, nnz=1166243])


### Question 4: ogbn-arxiv 的图中有多少特征？

In [10]:
def graph_num_features(data):
  # TODO: 实现一个函数，接收一个 PyG 的 data 对象，
  # 并返回该图的特征数量（为一个整数）。

  num_features = data.x.size(1)

  ############# Your code here ############
  ## (~1 行代码)

  #########################################

  return num_features

if 'IS_GRADESCOPE_ENV' not in os.environ:
  num_features = graph_num_features(data)
  print('The graph has {} features'.format(num_features))


The graph has 128 features


# 3） GNN：节点属性预测

在本节中，我们将使用 PyTorch Geometric 构建第一个图神经网络。然后，我们会将其应用于**节点属性预测（节点分类）**任务。

具体来说，我们将以 **GCN（图卷积网络）** 作为图神经网络的基础（参考 [Kipf 等人, 2017](https://arxiv.org/abs/1609.02907)）。  
为此，我们将使用 PyG 内置的 `GCNConv` 层。

## 环境搭建

In [11]:
import torch
import pandas as pd
import torch.nn.functional as F
print(torch.__version__)

# 使用 PyG 内建的 GCNConv
from torch_geometric.nn import GCNConv

import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator

2.5.1+cu124


## 加载并处理数据

In [12]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  dataset_name = 'ogbn-arxiv'
  dataset = PygNodePropPredDataset(name=dataset_name,
                                  transform=T.Compose([T.ToUndirected(),T.ToSparseTensor()]))
  data = dataset[0]

  device = 'cuda' if torch.cuda.is_available() else 'cpu'

  # 如果你在使用 gpu ， device 应该是 cuda
  print('Device: {}'.format(device))

  data = data.to(device)
  split_idx = dataset.get_idx_split()
  train_idx = split_idx['train'].to(device)

  self.data, self.slices = torch.load(self.processed_paths[0])


Device: cuda


## GCN 模型

现在我们来实现我们的 GCN 模型！

请根据下图所示的结构来实现 `forward` 函数：
![GCN 模型结构图](https://drive.google.com/uc?id=128AuYAXNXGg7PIhJJ7e420DoPWKb-RtL)

In [14]:
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout, return_embeds=False):
        # TODO: 实现一个函数来初始化 self.convs、self.bns 和 self.softmax。

        super(GCN, self).__init__()

        # 一个包含 GCNConv 层的列表
        self.convs = torch.nn.ModuleList()

        # 一个包含一维批归一化层（BatchNorm1d）的列表
        self.bns = torch.nn.ModuleList()

        # log softmax 层
        self.softmax = torch.nn.LogSoftmax(dim=1)

        ############# Your code here ############
        ## 注意：
        ## 1. self.convs 和 self.bns 应该使用 torch.nn.ModuleList
        ## 2. self.convs 应包含 num_layers 个 GCNConv 层
        ## 3. self.bns 应包含 num_layers - 1 个 BatchNorm1d 层
        ## 4. self.softmax 应使用 torch.nn.LogSoftmax
        ## 5. GCNConv 需要设置的参数包括 'in_channels' 和 'out_channels'
        ##    更多信息请参考文档：
        ##    https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv
        ## 6. BatchNorm1d 只需要设置 'num_features'
        ##    更多信息请参考文档：
        ##    https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
        ## （大约 10 行代码）

        #########################################
        self.convs.append(GCNConv(input_dim, hidden_dim))

        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_dim, hidden_dim))

        self.convs.append(GCNConv(hidden_dim, output_dim))

        for _ in range(num_layers - 1):
            self.bns.append(torch.nn.BatchNorm1d(hidden_dim))

        # 元素被置为 0 的概率（Dropout 概率）
        self.dropout = dropout

        # 是否跳过分类层并返回节点嵌入
        self.return_embeds = return_embeds

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t):
        # TODO: 实现一个函数，接收特征张量 x 和边索引张量 adj_t，
        # 并按结构图所示返回输出张量。

        ############# Your code here ############
        ## 注意：
        ## 1. 按照结构图构建神经网络
        ## 2. 可以使用 torch.nn.functional.relu 和 torch.nn.functional.dropout
        ##    文档参考：https://pytorch.org/docs/stable/nn.functional.html
        ## 3. 不要忘了将 F.dropout 的 training 参数设置为 self.training
        ## 4. 如果 return_embeds 为 True，则跳过最后的 softmax 层
        ## （大约 7 行代码）

        #########################################

        for i, conv in enumerate(self.convs[:-1]):  # 除了最后一层
            x = conv(x, adj_t)
            x = self.bns[i](x)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.convs[-1](x, adj_t)  # 最后一层

        if self.return_embeds:
            return x  # 不做 softmax，直接返回嵌入
        else:
            return self.softmax(x)


In [23]:
def train(model, data, train_idx, optimizer, loss_fn):
    # TODO: 实现一个使用给定的优化器和损失函数训练模型的函数。
    model.train()
    loss = 0

    ############# Your code here ############
    ## 注意：
    ## 1. 对优化器执行 zero grad（清除梯度）
    ## 2. 将数据输入模型
    ## 3. 使用 train_idx 对模型输出和标签进行切片
    ## 4. 将切片后的输出和标签输入损失函数 loss_fn
    ## （大约 4 行代码）

    #########################################
    optimizer.zero_grad()
    out = model(data.x, data.adj_t)      
    loss = loss_fn(out[train_idx], data.y[train_idx].squeeze())

    loss.backward()
    optimizer.step()

    return loss.item()


In [25]:
# 测试函数
@torch.no_grad()
def test(model, data, split_idx, evaluator, save_model_results=False):
    # TODO: 实现一个使用给定的 split_idx 和 evaluator 来测试模型的函数。
    model.eval()

    # 模型在所有数据上的输出
    out = model(data.x, data.adj_t)

    ############# Your code here ############
    ## （大约 1 行代码）
    ## 注意：
    ## 1. 此处不进行索引切片

    #########################################

    y_pred = out.argmax(dim=-1, keepdim=True)

    train_acc = evaluator.eval({
        'y_true': data.y[split_idx['train']],
        'y_pred': y_pred[split_idx['train']],
    })['acc']
    valid_acc = evaluator.eval({
        'y_true': data.y[split_idx['valid']],
        'y_pred': y_pred[split_idx['valid']],
    })['acc']
    test_acc = evaluator.eval({
        'y_true': data.y[split_idx['test']],
        'y_pred': y_pred[split_idx['test']],
    })['acc']

    if save_model_results:
      print ("Saving Model Predictions")

      data = {}
      data['y_pred'] = y_pred.view(-1).cpu().detach().numpy()

      df = pd.DataFrame(data=data)
      # 本地保存为 CSV 文件
      df.to_csv('ogbn-arxiv_node.csv', sep=',', index=False)

    return train_acc, valid_acc, test_acc


In [17]:
# 请不要改变 args
if 'IS_GRADESCOPE_ENV' not in os.environ:
  args = {
      'device': device,
      'num_layers': 3,
      'hidden_dim': 256,
      'dropout': 0.5,
      'lr': 0.01,
      'epochs': 100,
  }
  args

In [18]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  model = GCN(data.num_features, args['hidden_dim'],
              dataset.num_classes, args['num_layers'],
              args['dropout']).to(device)
  evaluator = Evaluator(name='ogbn-arxiv')

In [26]:
# 请不要改变 args
# 使用 GPU 训练应该小于 10 分钟
import copy
if 'IS_GRADESCOPE_ENV' not in os.environ:
  # reset the parameters to initial random value
  model.reset_parameters()

  optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
  loss_fn = F.nll_loss

  best_model = None
  best_valid_acc = 0

  for epoch in range(1, 1 + args["epochs"]):
    loss = train(model, data, train_idx, optimizer, loss_fn)
    result = test(model, data, split_idx, evaluator)
    train_acc, valid_acc, test_acc = result
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model)
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')

Epoch: 01, Loss: 4.2296, Train: 28.14%, Valid: 30.47% Test: 27.42%
Epoch: 02, Loss: 2.3289, Train: 24.69%, Valid: 21.80% Test: 26.87%
Epoch: 03, Loss: 1.9262, Train: 23.50%, Valid: 19.36% Test: 22.94%
Epoch: 04, Loss: 1.7553, Train: 21.77%, Valid: 14.88% Test: 14.41%
Epoch: 05, Loss: 1.6714, Train: 26.11%, Valid: 24.49% Test: 24.56%
Epoch: 06, Loss: 1.6036, Train: 32.70%, Valid: 31.54% Test: 31.64%
Epoch: 07, Loss: 1.5296, Train: 36.31%, Valid: 30.14% Test: 32.29%
Epoch: 08, Loss: 1.4685, Train: 37.81%, Valid: 30.34% Test: 33.16%
Epoch: 09, Loss: 1.4272, Train: 36.85%, Valid: 29.42% Test: 32.82%
Epoch: 10, Loss: 1.3845, Train: 35.69%, Valid: 28.25% Test: 32.08%
Epoch: 11, Loss: 1.3520, Train: 34.90%, Valid: 27.85% Test: 31.59%
Epoch: 12, Loss: 1.3241, Train: 34.72%, Valid: 27.33% Test: 31.01%
Epoch: 13, Loss: 1.2998, Train: 35.53%, Valid: 27.66% Test: 31.15%
Epoch: 14, Loss: 1.2820, Train: 37.61%, Valid: 29.60% Test: 33.58%
Epoch: 15, Loss: 1.2600, Train: 40.63%, Valid: 32.72% Test: 37

### Question 5 ：你的**最佳模型**验证集和测试集精度如何？

运行下面的代码单元格，可以查看你最优模型的预测结果，  
并将模型的预测保存到名为 `ogbn-arxiv_node.csv` 的文件中。

In [27]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  best_result = test(best_model, data, split_idx, evaluator, save_model_results=True)
  train_acc, valid_acc, test_acc = best_result
  print(f'Best model: '
        f'Train: {100 * train_acc:.2f}%, '
        f'Valid: {100 * valid_acc:.2f}% '
        f'Test: {100 * test_acc:.2f}%')

Saving Model Predictions
Best model: Train: 73.75%, Valid: 71.81% Test: 70.71%


# 4） GNN：图性质预测

在这一节中我们将创建一个为图性质预测的 GNN

## 加载并预处理数据集

In [28]:
from ogb.graphproppred import PygGraphPropPredDataset, Evaluator
from torch_geometric.data import DataLoader
from tqdm import tqdm

if 'IS_GRADESCOPE_ENV' not in os.environ:
  # 加载数据集
  dataset = PygGraphPropPredDataset(name='ogbg-molhiv')

  device = 'cuda' if torch.cuda.is_available() else 'cpu'
  print('Device: {}'.format(device))

  split_idx = dataset.get_idx_split()

  # 检查任务类型
  print('Task type: {}'.format(dataset.task_type))

Downloading http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip


Downloaded 0.00 GB: 100%|██████████| 3/3 [00:03<00:00,  1.15s/it]
Processing...


Extracting dataset/hiv.zip
Loading necessary files...
This might take a while.
Processing graphs...


100%|██████████| 41127/41127 [00:00<00:00, 94095.28it/s] 


Converting graphs into PyG objects...


100%|██████████| 41127/41127 [00:05<00:00, 7479.53it/s] 


Saving...
Device: cuda
Task type: binary classification


Done!
  self.data, self.slices = torch.load(self.processed_paths[0])


In [29]:
# 将数据集划分加载到对应的 dataloader 中
# 我们将在每批 32 个图上进行图分类任务的训练
# 对训练集中的图顺序进行打乱
if 'IS_GRADESCOPE_ENV' not in os.environ:
  train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, num_workers=0)
  valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, num_workers=0)
  test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, num_workers=0)



In [30]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  # Please do not change the args
  args = {
      'device': device,
      'num_layers': 5,
      'hidden_dim': 256,
      'dropout': 0.5,
      'lr': 0.001,
      'epochs': 30,
  }
  args

## 图预测模型

图的 Mini-Batching（小批量处理）

在正式进入模型之前，我们先介绍图数据的 mini-batching 概念。为了并行处理一小批图，  
PyG 会将这些图组合成一个**不相连的大图**数据对象（`torch_geometric.data.Batch`）。

`torch_geometric.data.Batch` 继承自之前介绍的 `torch_geometric.data.Data`，  
并额外包含一个名为 `batch` 的属性。

这个 `batch` 属性是一个向量，用来将每个节点映射到它在 mini-batch 中所属图的索引。例如：
<code>batch = [0, ..., 0, 1, ..., 1, ..., n - 2, n - 1, ..., n - 1]<code>

这个属性非常重要，它能帮助我们知道每个节点属于哪个图。  
举个例子，它可以用来对每个图的节点嵌入进行平均，从而得到图级别的嵌入表示。

### 补全

现在，我们已经具备了实现 GCN 图预测模型所需的所有工具！

我们将复用现有的 GCN 模型来生成 **节点嵌入（node_embeddings）**，  
然后对节点进行 **全局池化（Global Pooling）**，从而得到每个图的**图级别嵌入（graph level embeddings）**，  
这些嵌入将用于预测每个图的属性。

请记住，`batch` 属性对于在 mini-batch 中执行全局池化操作是很重要的。

In [31]:
from ogb.graphproppred.mol_encoder import AtomEncoder
from torch_geometric.nn import global_add_pool, global_mean_pool

### GCN 用于预测图属性
class GCN_Graph(torch.nn.Module):
    def __init__(self, hidden_dim, output_dim, num_layers, dropout):
        super(GCN_Graph, self).__init__()

        # 加载分子图中原子的编码器
        self.node_encoder = AtomEncoder(hidden_dim)

        # 节点嵌入模型
        # 注意：输入维度和输出维度都设置为 hidden_dim
        self.gnn_node = GCN(hidden_dim, hidden_dim,
            hidden_dim, num_layers, dropout, return_embeds=True)

        self.pool = global_mean_pool

        ############# Your code here ############
        ## 注意：
        ## 1. 将 self.pool 初始化为全局平均池化层（global mean pooling）
        ##    更多信息请参考文档：
        ##    https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers

        #########################################

        # 输出层
        self.linear = torch.nn.Linear(hidden_dim, output_dim)


    def reset_parameters(self):
      self.gnn_node.reset_parameters()
      self.linear.reset_parameters()

    def forward(self, batched_data):
        # TODO: 实现一个函数，输入是一批图（torch_geometric.data.Batch），
        # 返回的是每个图的预测属性。
        #
        # 注意：由于我们预测的是图级别的属性，
        x = self.node_encoder(batched_data.x)
        node_embeddings = self.gnn_node(x, batched_data.edge_index)
        graph_embeddings = self.pool(node_embeddings, batched_data.batch)
        out = self.linear(graph_embeddings)

        return out

In [32]:
def train(model, device, data_loader, optimizer, loss_fn):
    # TODO: 实现一个使用给定优化器和损失函数训练模型的函数。
    model.train()
    loss = 0

    for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
      batch = batch.to(device)

      if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
          continue
      else:
        ## 在计算训练损失时忽略包含 nan 的目标（未标注样本）
        is_labeled = batch.y == batch.y

        ############# Your code here ############
        ## 注意：
        ## 1. 对优化器执行 zero grad（清除梯度）
        ## 2. 将数据输入模型
        ## 3. 使用 `is_labeled` 掩码过滤输出和标签
        ## 4. 你可能需要将标签的类型转为 torch.float32
        ## 5. 将输出和标签传入 loss_fn 计算损失
        ## （大约 3 行代码）

        #########################################
        optimizer.zero_grad()
        out = model(batch)
        labeled_out = out[is_labeled]
        labeled_y = batch.y[is_labeled].float()
        loss = loss_fn(labeled_out, labeled_y)

        loss.backward()
        optimizer.step()

    return loss.item()


In [33]:
# 用于分析的函数
def eval(model, device, loader, evaluator, save_model_results=False, save_file=None):
    model.eval()
    y_true = []
    y_pred = []

    for step, batch in enumerate(tqdm(loader, desc="Iteration")):
        batch = batch.to(device)

        if batch.x.shape[0] == 1:
            pass
        else:
            with torch.no_grad():
                pred = model(batch)

            y_true.append(batch.y.view(pred.shape).detach().cpu())
            y_pred.append(pred.detach().cpu())

    y_true = torch.cat(y_true, dim = 0).numpy()
    y_pred = torch.cat(y_pred, dim = 0).numpy()

    input_dict = {"y_true": y_true, "y_pred": y_pred}

    if save_model_results:
        print ("Saving Model Predictions")

        # 创建一个包含两列的 pandas 数据框（DataFrame）
        # y_pred | y_true
        data = {}
        data['y_pred'] = y_pred.reshape(-1)
        data['y_true'] = y_true.reshape(-1)

        df = pd.DataFrame(data=data)
        # Save to csv
        df.to_csv('ogbg-molhiv_graph_' + save_file + '.csv', sep=',', index=False)

    return evaluator.eval(input_dict)

In [34]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  model = GCN_Graph(args['hidden_dim'],
              dataset.num_tasks, args['num_layers'],
              args['dropout']).to(device)
  evaluator = Evaluator(name='ogbg-molhiv')

In [35]:
import copy

if 'IS_GRADESCOPE_ENV' not in os.environ:
  model.reset_parameters()

  optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
  loss_fn = torch.nn.BCEWithLogitsLoss()

  best_model = None
  best_valid_acc = 0

  for epoch in range(1, 1 + args["epochs"]):
    print('Training...')
    loss = train(model, device, train_loader, optimizer, loss_fn)

    print('Evaluating...')
    train_result = eval(model, device, train_loader, evaluator)
    val_result = eval(model, device, valid_loader, evaluator)
    test_result = eval(model, device, test_loader, evaluator)

    train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        best_model = copy.deepcopy(model)
    print(f'Epoch: {epoch:02d}, '
          f'Loss: {loss:.4f}, '
          f'Train: {100 * train_acc:.2f}%, '
          f'Valid: {100 * valid_acc:.2f}% '
          f'Test: {100 * test_acc:.2f}%')

Training...


Iteration: 100%|██████████| 1029/1029 [00:20<00:00, 50.91it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 111.11it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 85.83it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 85.62it/s]


Epoch: 01, Loss: 0.9359, Train: 70.35%, Valid: 66.61% Test: 65.21%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 71.71it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 107.99it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 106.99it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 111.40it/s]


Epoch: 02, Loss: 0.0576, Train: 74.60%, Valid: 71.81% Test: 71.85%
Training...


Iteration: 100%|██████████| 1029/1029 [00:15<00:00, 68.15it/s] 


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 110.67it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 109.79it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.92it/s]


Epoch: 03, Loss: 0.0257, Train: 77.21%, Valid: 74.34% Test: 70.04%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 63.69it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:08<00:00, 126.58it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 100.88it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 98.86it/s]


Epoch: 04, Loss: 1.2354, Train: 75.92%, Valid: 75.88% Test: 72.54%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 63.96it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:08<00:00, 124.64it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 101.89it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 98.60it/s]


Epoch: 05, Loss: 0.5099, Train: 76.43%, Valid: 75.21% Test: 71.87%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 64.17it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 109.63it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.98it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, -2823.31it/s]


Epoch: 06, Loss: 0.0197, Train: 78.51%, Valid: 75.53% Test: 71.38%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 63.26it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 108.96it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 105.65it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 107.89it/s]


Epoch: 07, Loss: 0.0310, Train: 77.97%, Valid: 75.25% Test: 72.04%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 71.03it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 107.85it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 107.80it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 101.63it/s]


Epoch: 08, Loss: 0.0315, Train: 78.57%, Valid: 75.33% Test: 72.96%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 68.86it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 107.06it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 98.76it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 103.32it/s]


Epoch: 09, Loss: 0.0460, Train: 78.23%, Valid: 76.40% Test: 74.69%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 69.70it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 109.18it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.00it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 111.31it/s]


Epoch: 10, Loss: 0.0856, Train: 79.14%, Valid: 74.50% Test: 74.22%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 63.10it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:08<00:00, 125.27it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 105.41it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 106.67it/s]


Epoch: 11, Loss: 0.0274, Train: 78.69%, Valid: 76.07% Test: 69.55%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 62.86it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:08<00:00, 126.99it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 107.87it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 105.45it/s]


Epoch: 12, Loss: 0.0243, Train: 80.08%, Valid: 73.80% Test: 74.42%
Training...


Iteration: 100%|██████████| 1029/1029 [00:15<00:00, 65.49it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 103.85it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, -1005.36it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.35it/s]


Epoch: 13, Loss: 0.0521, Train: 81.35%, Valid: 76.51% Test: 73.07%
Training...


Iteration: 100%|██████████| 1029/1029 [00:15<00:00, 65.28it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 109.75it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 113.20it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 112.34it/s]


Epoch: 14, Loss: 0.0282, Train: 81.23%, Valid: 74.40% Test: 73.22%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 63.21it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:13<00:00, 76.38it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 71.60it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 78.07it/s]


Epoch: 15, Loss: 0.0283, Train: 80.68%, Valid: 76.71% Test: 72.62%
Training...


Iteration: 100%|██████████| 1029/1029 [00:18<00:00, 56.70it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 103.45it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 98.43it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 109.37it/s]


Epoch: 16, Loss: 0.0506, Train: 80.66%, Valid: 75.73% Test: 72.06%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 70.94it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:13<00:00, 75.28it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 81.86it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 83.13it/s]


Epoch: 17, Loss: 0.0634, Train: 81.16%, Valid: 76.88% Test: 74.58%
Training...


Iteration: 100%|██████████| 1029/1029 [00:19<00:00, 54.15it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:13<00:00, 77.98it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, 448.10it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 75.05it/s]


Epoch: 18, Loss: 0.0359, Train: 82.18%, Valid: 76.35% Test: 72.52%
Training...


Iteration: 100%|██████████| 1029/1029 [00:18<00:00, 56.54it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 105.14it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 97.76it/s] 
Iteration: 100%|██████████| 129/129 [00:00<00:00, -18095.22it/s]


Epoch: 19, Loss: 0.0348, Train: 82.13%, Valid: 75.62% Test: 74.36%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 62.90it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 105.76it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 110.57it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 110.72it/s]


Epoch: 20, Loss: 0.0232, Train: 82.21%, Valid: 75.83% Test: 71.59%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 71.59it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 110.54it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 111.02it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 110.77it/s]


Epoch: 21, Loss: 0.0334, Train: 82.49%, Valid: 78.07% Test: 73.92%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 69.21it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 107.17it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.04it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.51it/s]


Epoch: 22, Loss: 0.0305, Train: 82.92%, Valid: 75.59% Test: 73.51%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 70.50it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 107.86it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 101.72it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 110.85it/s]


Epoch: 23, Loss: 0.0509, Train: 82.36%, Valid: 77.03% Test: 74.67%
Training...


Iteration: 100%|██████████| 1029/1029 [00:15<00:00, 64.94it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:08<00:00, 126.15it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 106.61it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 106.39it/s]


Epoch: 24, Loss: 0.0301, Train: 82.55%, Valid: 76.97% Test: 75.07%
Training...


Iteration: 100%|██████████| 1029/1029 [00:16<00:00, 62.73it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:08<00:00, 122.90it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.05it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 112.10it/s]


Epoch: 25, Loss: 0.0196, Train: 83.17%, Valid: 76.94% Test: 73.73%
Training...


Iteration: 100%|██████████| 1029/1029 [00:15<00:00, 65.37it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 107.92it/s]
Iteration: 100%|██████████| 129/129 [00:00<00:00, -1666.42it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.52it/s]


Epoch: 26, Loss: 0.0195, Train: 83.28%, Valid: 74.85% Test: 74.43%
Training...


Iteration: 100%|██████████| 1029/1029 [00:15<00:00, 64.33it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 107.70it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 107.53it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 92.70it/s]


Epoch: 27, Loss: 0.0211, Train: 82.98%, Valid: 80.02% Test: 74.91%
Training...


Iteration: 100%|██████████| 1029/1029 [00:15<00:00, 68.24it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:09<00:00, 106.91it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 99.97it/s] 
Iteration: 100%|██████████| 129/129 [00:01<00:00, 107.93it/s]


Epoch: 28, Loss: 0.0268, Train: 83.60%, Valid: 79.03% Test: 75.17%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 69.62it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:10<00:00, 102.05it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 107.31it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 108.23it/s]


Epoch: 29, Loss: 0.0379, Train: 83.72%, Valid: 78.19% Test: 74.74%
Training...


Iteration: 100%|██████████| 1029/1029 [00:14<00:00, 68.75it/s]


Evaluating...


Iteration: 100%|██████████| 1029/1029 [00:10<00:00, 101.94it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 111.65it/s]
Iteration: 100%|██████████| 129/129 [00:01<00:00, 109.16it/s]

Epoch: 30, Loss: 0.0251, Train: 83.11%, Valid: 78.58% Test: 74.38%





### Quesion 6： 你的最佳模型的验证/测试 ROC-AUC 分数多少？

运行下方的代码单元格，以查看你最优模型的预测结果，  
并将预测分别保存为两个文件：`ogbg-molhiv_graph_valid.csv` 和 `ogbg-molhiv_graph_test.csv`。

In [36]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  train_auroc = eval(best_model, device, train_loader, evaluator)[dataset.eval_metric]
  valid_auroc = eval(best_model, device, valid_loader, evaluator, save_model_results=True, save_file="valid")[dataset.eval_metric]
  test_auroc  = eval(best_model, device, test_loader, evaluator, save_model_results=True, save_file="test")[dataset.eval_metric]

  print(f'Best model: '
      f'Train: {100 * train_auroc:.2f}%, '
      f'Valid: {100 * valid_auroc:.2f}% '
      f'Test: {100 * test_auroc:.2f}%')

Iteration: 100%|██████████| 1029/1029 [00:19<00:00, 53.81it/s] 
Iteration: 100%|██████████| 129/129 [00:01<00:00, 90.40it/s]


Saving Model Predictions


Iteration: 100%|██████████| 129/129 [00:01<00:00, 97.93it/s]

Saving Model Predictions
Best model: Train: 82.98%, Valid: 80.02% Test: 74.91%





### Question 7（选做）：在PyG中测试另外两种 global pooling

In [None]:
############# Your code here ############