# CORA

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

Original source: [web.archive.org](https://web.archive.org/web/20151007064508/http://linqs.cs.umd.edu/projects/projects/lbc/)

In [1]:
from torch_geometric.datasets import Planetoid

In [2]:
dataset = Planetoid(root='/tmp/cora', name='Cora')

In [3]:
dataset

Cora()

In [4]:
from torch_geometric.data import DataLoader

In [5]:
loader = DataLoader(dataset, batch_size=32, shuffle=False)



In [6]:
batch_i = 0
for batch in loader:
    print(batch)
    batch_i = batch_i +1

print(batch_i)

DataBatch(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], batch=[2708], ptr=[2])
1


In [7]:
type(batch)

torch_geometric.data.batch.DataBatch

In [8]:
data = dataset[0]

print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

print('==============================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of node features: {data.num_node_features}')
print(f'Number of edges: {data.num_edges}')
print(f'Number of edge features: {data.num_edge_features}')
print(f'Average node degree: {(2*data.num_edges) / data.num_nodes:.2f}')

print("============= split ==========")

print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Number of validation nodes: {data.val_mask.sum()}')
print(f'validation node label rate: {int(data.val_mask.sum()) / data.num_nodes:.2f}')
print(f'Number of test nodes: {data.test_mask.sum()}')
print(f'test node label rate: {int(data.test_mask.sum()) / data.num_nodes:.2f}')

print("============ properties ===========")
print(f'Contains isolated nodes: {data.has_isolated_nodes()}')
print(f'Contains self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')
print(f'Is directed: {data.is_directed()}')


Dataset: Cora():
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of node features: 1433
Number of edges: 10556
Number of edge features: 0
Average node degree: 7.80
Number of training nodes: 140
Training node label rate: 0.05
Number of validation nodes: 500
validation node label rate: 0.18
Number of test nodes: 1000
test node label rate: 0.37
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True
Is directed: False


In [9]:
data

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

`x=[2708, 1433]`: `[num_nodes, num_node_features]` Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

`edge_index=[2, 10556]`: Graph connectivity in COO format with shape [2, num_edges] and type torch.long If :obj:`edge_index` is of type :obj:`torch.LongTensor`, its shape must be defined as :obj:`[2, num_messages]`, where messages from nodes in :obj:`edge_index[0]` are sent to nodes in :obj:`edge_index[1]` [source](https://github.com/pyg-team/pytorch_geometric/blob/master/torch_geometric/nn/conv/message_passing.py)