# Example of loading datasets in GRB

[GRB](https://cogdl.ai/grb/home) supports internal datasets of different scales and specific preprocessing, and also external datasets from [CogDL](https://cogdl.ai/) and [OGB](https://ogb.stanford.edu/). All datasets can be automatically downloaded via the following examples. In case of any problem, you can also download them mannually by the [link](https://cloud.tsinghua.edu.cn/d/c77db90e05e74a5c9b8b/).

Contents
- [GRB Datasets](#GRB-Datasets)
- [CogDL Datasets](#CogDL-Datasets)
- [OGB Datasets](#OGB-Datasets)
- [Prepare Dataset](#Prepare-Dataset)

In [1]:
import os
from grb.dataset import Dataset, CogDLDataset, OGBDataset

data_dir="../data/"

## GRB Datasets

GRB datasets are named by the prefix *grb-*. There are four *mode* ('easy', 'medium', 'hard', 'full') for test set, representing different average degrees of test nodes, thus different difficulty for attacking them. The node features are processed by *arctan* normalization (first standardization then arctan function), which makes node features fall in the same scale.

### grb-cora

In [2]:
dataset = Dataset(name="grb-cora", 
                  data_dir=data_dir, 
                  mode='full', 
                  feat_norm="arctan")

Dataset 'grb-cora' loaded.
    Number of nodes: 2680
    Number of edges: 5148
    Number of features: 302
    Number of classes: 7
    Number of train samples: 1608
    Number of val samples: 268
    Number of test samples: 804
    Dataset mode: full
    Feature range: [-0.9406, 0.9430]


### grb-citeseer

In [3]:
dataset = Dataset(name="grb-citeseer", 
                  data_dir=data_dir, 
                  mode='full', 
                  feat_norm="arctan")

Dataset 'grb-citeseer' loaded.
    Number of nodes: 3191
    Number of edges: 4172
    Number of features: 768
    Number of classes: 6
    Number of train samples: 1914
    Number of val samples: 320
    Number of test samples: 957
    Dataset mode: full
    Feature range: [-0.9585, 0.8887]


### grb-flickr

In [4]:
dataset = Dataset(name="grb-flickr", 
                  data_dir=data_dir, 
                  mode='full', 
                  feat_norm="arctan")

Dataset 'grb-flickr' loaded.
    Number of nodes: 89250
    Number of edges: 449878
    Number of features: 500
    Number of classes: 7
    Number of train samples: 53550
    Number of val samples: 8925
    Number of test samples: 26775
    Dataset mode: full
    Feature range: [-0.4665, 0.9976]


### grb-reddit

In [5]:
dataset = Dataset(name="grb-reddit", 
                  data_dir=data_dir, 
                  mode='full', 
                  feat_norm="arctan")

Dataset 'grb-reddit' loaded.
    Number of nodes: 232965
    Number of edges: 11606919
    Number of features: 602
    Number of classes: 41
    Number of train samples: 139779
    Number of val samples: 23298
    Number of test samples: 69888
    Dataset mode: full
    Feature range: [-0.9774, 0.9947]


### grb-aminer

In [6]:
dataset = Dataset(name="grb-aminer", 
                  data_dir=data_dir, 
                  mode='full', 
                  feat_norm="arctan")

Dataset 'grb-aminer' loaded.
    Number of nodes: 659574
    Number of edges: 2878577
    Number of features: 100
    Number of classes: 18
    Number of train samples: 395744
    Number of val samples: 65959
    Number of test samples: 197871
    Dataset mode: full
    Feature range: [-0.9326, 0.9290]


## CogDL Datasets

### Cora

In [7]:
dataset = CogDLDataset(name="cora", data_dir=data_dir)

cannot import name 'queue' from 'torch._six' (/home/stanislas/anaconda3/envs/grb/lib/python3.8/site-packages/torch/_six.py)
Failed to load fast version of SpMM, use torch.scatter_add instead.
Dataset 'cora' loaded.
    Number of nodes: 2708
    Number of edges: 5092
    Number of features: 1433
    Number of classes: 7
    Number of train samples: 140
    Number of val samples: 500
    Number of test samples: 1000
    Feature range: [0.0000, 1.0000]


### Citeseer

In [8]:
dataset = CogDLDataset(name="citeseer", data_dir=data_dir)

Dataset 'citeseer' loaded.
    Number of nodes: 3327
    Number of edges: 4552
    Number of features: 3703
    Number of classes: 6
    Number of train samples: 120
    Number of val samples: 500
    Number of test samples: 1000
    Feature range: [0.0000, 0.1250]


### Pubmed

In [9]:
dataset = CogDLDataset(name="pubmed", data_dir=data_dir)

Dataset 'pubmed' loaded.
    Number of nodes: 19717
    Number of edges: 44324
    Number of features: 500
    Number of classes: 3
    Number of train samples: 60
    Number of val samples: 500
    Number of test samples: 1000
    Feature range: [0.0000, 0.4862]


### Flickr

In [10]:
dataset = CogDLDataset(name="flickr", data_dir=data_dir)

Dataset 'flickr' loaded.
    Number of nodes: 89250
    Number of edges: 449878
    Number of features: 500
    Number of classes: 7
    Number of train samples: 44625
    Number of val samples: 22312
    Number of test samples: 22313
    Feature range: [-0.8998, 269.9578]


### Reddit

In [11]:
dataset = CogDLDataset(name="reddit", data_dir=data_dir)

Dataset 'reddit' loaded.
    Number of nodes: 232965
    Number of edges: 11606919
    Number of features: 602
    Number of classes: 41
    Number of train samples: 153932
    Number of val samples: 23699
    Number of test samples: 55334
    Feature range: [-28.1936, 120.9568]


## OGB Datasets

### ogbn-arxiv

In [12]:
dataset = OGBDataset(name="ogbn-arxiv", data_dir=data_dir)

Dataset 'ogbn-arxiv' loaded.
    Number of nodes: 169343
    Number of edges: 1166243
    Number of features: 128
    Number of classes: 40
    Number of train samples: 90941
    Number of val samples: 29799
    Number of test samples: 48603
    Feature range: [-1.3889, 1.6387]


Using backend: pytorch


In [13]:
dataset_name = "ogbn-arxiv"
dataset = OGBDataset(name=dataset_name, data_dir="../../../dataset/")

Dataset 'ogbn-arxiv' loaded.
    Number of nodes: 169343
    Number of edges: 1166243
    Number of features: 128
    Number of classes: 40
    Number of train samples: 90941
    Number of val samples: 29799
    Number of test samples: 48603
    Feature range: [-1.3889, 1.6387]


### ogbn-products

In [14]:
dataset = OGBDataset(name="ogbn-products", data_dir=data_dir)

Dataset 'ogbn-products' loaded.
    Number of nodes: 2449029
    Number of edges: 61859140
    Number of features: 100
    Number of classes: 47
    Number of train samples: 196615
    Number of val samples: 39323
    Number of test samples: 2213091
    Feature range: [-1434.0566, 904.9496]


### ogbn-proteins

In [15]:
dataset = OGBDataset(name="ogbn-proteins", data_dir=data_dir)

Dataset 'ogbn-proteins' loaded.
    Number of nodes: 132534
    Number of edges: 39561252
    Number of features: 8
    Number of classes: 2
    Number of tasks: 112
    Number of train samples: 86619
    Number of val samples: 21236
    Number of test samples: 24679
    Feature range: [0.0010, 1.0000]


## Prepare Dataset

In [16]:
adj = dataset.adj
features = dataset.features
labels = dataset.labels
num_nodes = dataset.num_nodes
num_features = dataset.num_features
num_classes = dataset.num_classes
train_mask = dataset.train_mask
val_mask = dataset.val_mask
test_mask = dataset.test_mask