# 二、 图数据集加载与分析


| 数据集          | 类型           | 描述                     | 节点数  | 边数    | 类别数 |
|-----------------|----------------|--------------------------|---------|---------|--------|
| Cora            | 同构图         | 引文网络，预测论文类别   | 2,708   | 5,429   | 7      |
| CiteSeer        | 同构图         | 引文网络，预测论文类别   | 3,312   | 4,732   | 6      |
| PubMed          | 同构图         | 生物医学引文网络         | 19,717  | 44,338  | 3      |
| Reddit          | 大规模同构图   | Reddit帖子关系图         | 232,965 | 11,606,919 | 41    |
| ogbn-arxiv      | 大规模同构图   | arXiv论文引用网络        | 169,343 | 1,166,243 | 40    |
| ogbn-products   | 工业级同构图   | Amazon产品共购买图       | 2,449,029 | 61,859,140 | 47    |

## 1. 环境配置

设置Hugging Face镜像源，用于解决国内访问Hugging Face模型时的网络问题，确保能够正常下载和使用预训练模型。

In [1]:
# 设置镜像
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
print("》》模型加载成功。")

》》模型加载成功。


## 2. 加载数据集并保存到本地

使用PyTorch Geometric (PyG)库加载Cora, CiteSeer, PubMed, Reddit数据集，Open Graph Benchmark (OGB)库加载ogbn_arxiv, ogbn_products数据集。

这些都是适用于节点分类任务的数据集。

In [2]:
from torch_geometric.datasets import Planetoid, Reddit  # Cora, CiteSeer, PubMed, Reddit
from ogb.nodeproppred import PygNodePropPredDataset  # ogbn_arxiv, ogbn_products

# 如果遇到torch.load()问题，则需要添加所有必要的安全全局变量
import torch
from torch_geometric.data.data import DataEdgeAttr, DataTensorAttr
from torch_geometric.data.storage import GlobalStorage

torch.serialization.add_safe_globals([
    DataEdgeAttr,
    DataTensorAttr,
    GlobalStorage
    # 添加其他可能需要的类
])

# Cora, CiteSeer, PubMed
cora_dataset = Planetoid(root='../datasets', name='Cora')
citeseer_dataset = Planetoid(root='../datasets', name='CiteSeer')
pubmed_dataset = Planetoid(root='../datasets', name='PubMed')
print("Planetoid datasets loaded successfully.")

# Reddit
reddit_dataset = Reddit(root='../datasets/Reddit')
print("Reddit dataset loaded successfully.")

# ogbn_arxiv, ogbn_products
ogbn_arxiv_dataset = PygNodePropPredDataset(name='ogbn-arxiv', root='../datasets')
ogbn_products_dataset = PygNodePropPredDataset(name='ogbn-products', root='../datasets')
print("OGB datasets loaded successfully.")

print("All datasets downloaded.")

  from .autonotebook import tqdm as notebook_tqdm


Planetoid datasets loaded successfully.
Reddit dataset loaded successfully.
OGB datasets loaded successfully.
All datasets downloaded.


## 3. 数据集分析

逐个分析每个数据集的统计信息：

其中，data = i[0]（实际是dataset[0]）输出如下：

The first graph: Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

属性含义：
- **x**： 节点特征矩阵 [节点数2708 × 特征维度1433]
- **edge_index**： 边信息 [2维 × 边数10556] - 2代表边信息用相连接的两个节点的节点对表示
- **y**： 节点标签 [节点数2708 × 1维] - 每个节点的类别
- **train_mask/val_mask/test_mask**： 训练/验证/测试集划分 [节点数节点数2708 × 1维] - 布尔掩码
- 掩码用于**划分数据集**，告诉模型在训练、验证、测试时分别使用哪些节点，0表示无表情，1表示有标签

In [3]:
for i in [cora_dataset, citeseer_dataset, pubmed_dataset, reddit_dataset, ogbn_arxiv_dataset, ogbn_products_dataset]:
    print(f'===== Dataset: {i} ' + '=' * 100)
    # 数据集的统计信息
    print(f'Number of graphs: {len(i)}')
    print(f'Number of features: {i.num_features}')
    print(f'Number of classes: {i.num_classes}')
    print()

    # 第一个图的统计信息
    data = i[0]  # 数据集中的第一个图
    print(f"The first graph: {data}")
    print(f'Number of nodes: {data.num_nodes}')
    print(f'Number of edges: {data.num_edges}')
    print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
    print(f'Has isolated nodes: {data.has_isolated_nodes()}')
    print(f'Has self-loops: {data.has_self_loops()}')
    print(f'Is undirected: {data.is_undirected()}')

    if i in [cora_dataset, citeseer_dataset, pubmed_dataset, reddit_dataset]:
        print(f'Number of training nodes: {data.train_mask.sum().item()}')
        print(f'Number of validation nodes: {data.val_mask.sum().item()}')
        print(f'Number of testing nodes: {data.test_mask.sum().item()}')
        print(f'Training node ratio: {data.train_mask.sum().item() / data.num_nodes:.2%}')
        print(f'Validation node ratio: {data.val_mask.sum().item() / data.num_nodes:.2%}')
        print(f'Testing node ratio: {data.test_mask.sum().item() / data.num_nodes:.2%}')
    else:
        # OGB数据集需要额外获取分割信息
        split_idx = i.get_idx_split()
        train_idx = split_idx['train']
        valid_idx = split_idx['valid']
        test_idx = split_idx['test']
        print(f'Number of training nodes: {len(train_idx)}')
        print(f'Number of validation nodes: {len(valid_idx)}')
        print(f'Number of testing nodes: {len(test_idx)}')
        print(f'Training node label rate: {len(train_idx) / data.num_nodes:.2f}')
        print(f'Validation node label rate: {len(valid_idx) / data.num_nodes:.2f}')
        print(f'Testing node label rate: {len(test_idx) / data.num_nodes:.2f}')

    print('=' * 100)
    print()
    print()


Number of graphs: 1
Number of features: 1433
Number of classes: 7

The first graph: Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Has isolated nodes: False
Has self-loops: False
Is undirected: True
Number of training nodes: 140
Number of validation nodes: 500
Number of testing nodes: 1000
Training node ratio: 5.17%
Validation node ratio: 18.46%
Testing node ratio: 36.93%


Number of graphs: 1
Number of features: 3703
Number of classes: 6

The first graph: Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])
Number of nodes: 3327
Number of edges: 9104
Average node degree: 2.74
Has isolated nodes: True
Has self-loops: False
Is undirected: True
Number of training nodes: 120
Number of validation nodes: 500
Number of testing nodes: 1000
Training node ratio: 3.61%
Validation node ratio: 15.03%
Testing node r