# Make Your Own Dataset

This tutorial assumes that you already know [the basics of training a GNN for node classification](1_introduction.ipynb) and [how to create, load, and store a DGL graph](2_dglgraph.ipynb).

Goal of this tutorial:

* Create your own graph dataset for node classification, link prediction, or graph classification.

## `DGLDataset` Object Overview

Your custom graph dataset should subclass the `dgl.data.DGLDataset` object and implement the following methods:

* `__getitem__(self, i)`: retrieve the `i`-th example of the dataset.  An example often contains a single DGL graph
* `__len__(self)`: the number of examples in the dataset.

## Creating a Dataset for Node Classification or Link Prediction from CSV

A node classification dataset often consists of a single graph, as well as its node and edge features.

Here we take a small dataset based on [Zachary's Karate Club network](https://en.wikipedia.org/wiki/Zachary%27s_karate_club).  It contains
* A `members.csv` file containing the attributes of all members, as well as their attributes.
* An `interactions.csv` file containing the pair-wise interactions between two club members.

We will treat the members as nodes and interactions as edges.  We will take age as a numeric feature of the nodes, affiliated club as the label of the nodes, and edge weight as a numeric feature of the edges.

<div class="alert alert-info">
    
**Note**: the original Zachary's Karate Club network does not have member ages.  The ages in this tutorial are generated synthetically for demonstrating how to add node features into the graph for dataset creation.
    
</div>

<div class="alert alert-info">
    
**Note**: in practice taking age directly as a numeric feature may not work well in machine learning; strategies like binning or normalizing the feature would work better.  Here we are directly taking the values as-is for simplicity.
    
</div>

In [39]:
import dgl
from dgl.data import DGLDataset
import pandas as pd
import torch
import os

class KarateClubDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='karate_club')
        
    def process(self):
        nodes_data = pd.read_csv('data/members.csv')
        edges_data = pd.read_csv('data/interactions.csv')
        node_features = torch.from_numpy(nodes_data['Age'].to_numpy())
        node_labels = torch.from_numpy(nodes_data['Club'].astype('category').cat.codes.to_numpy())
        edge_features = torch.from_numpy(edges_data['Weight'].to_numpy())
        edges_src = torch.from_numpy(edges_data['Src'].to_numpy())
        edges_dst = torch.from_numpy(edges_data['Dst'].to_numpy())
        
        self.graph = dgl.graph((edges_src, edges_dst), num_nodes=nodes_data.shape[0])
        self.graph.ndata['feat'] = node_features
        self.graph.ndata['label'] = node_labels
        self.graph.edata['weight'] = edge_features
        
    def __getitem__(self, i):
        return self.graph
    
    def __len__(self):
        return 1
    
    def save(self):
        dgl.save_graphs(os.path.join(self.save_path, 'karate.dgl'), self.graph)
        
    def load(self):
        (self.graph,), _ = dgl.load_graphs(os.path.join(self.save_path, 'karate.dgl'))

In [40]:
dataset = KarateClubDataset()
graph = dataset[0]

  from ipykernel import kernelapp as app


In [41]:
print(graph)

Graph(num_nodes=34, num_edges=156,
      ndata_schemes={'feat': Scheme(shape=(), dtype=torch.int64), 'label': Scheme(shape=(), dtype=torch.int8)}
      edata_schemes={'weight': Scheme(shape=(), dtype=torch.float64)})


Since a link prediction dataset only involves a single graph, preparing a link prediction dataset will have the same experience as preparing a node classification dataset.

## Creating a Dataset for Graph Classification from CSV

Creating a graph classification dataset involves implementing `__getitem__` to return both the graph and its graph-level label.

This tutorial demonstrates how to create a graph classification dataset with the following synthetic CSV data:

* `graph_edges.csv`: containing three columns:
  * `graph_id`: the ID of the graph.
  * `src`: the source node of an edge of the given graph.
  * `dst`: the destination node of an edge of the given graph.
* `graph_properties.csv`: containing three columns:
  * `graph_id`: the ID of the graph.
  * `label`: the label of the graph.
  * `num_nodes`: the number of nodes in the graph.

In [42]:
edges = pd.read_csv('data/graph_edges.csv')
properties = pd.read_csv('data/graph_properties.csv')

In [43]:
edges.head()

Unnamed: 0,graph_id,src,dst
0,0,0,1
1,0,0,14
2,0,1,0
3,0,1,2
4,0,2,1


In [44]:
properties.head()

Unnamed: 0,graph_id,label,num_nodes
0,0,0,15
1,1,0,10
2,2,0,13
3,3,0,13
4,4,0,17


In [63]:
class SyntheticDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='synthetic')
        
    def process(self):
        edges = pd.read_csv('data/graph_edges.csv')
        properties = pd.read_csv('data/graph_properties.csv')
        self.graphs = []
        self.labels = []
        
        # Create a graph for each graph ID from the edges table.
        # First process the properties table into two dictionaries with graph IDs as keys.
        # The label and number of nodes are values.
        label_dict = {}
        num_nodes_dict = {}
        for _, row in properties.iterrows():
            label_dict[row['graph_id']] = row['label']
            num_nodes_dict[row['graph_id']] = row['num_nodes']
            
        # For the edges, first group the table by graph IDs.
        edges_group = edges.groupby('graph_id')
        
        # For each graph ID...
        for graph_id in edges_group.groups:
            # Find the edges as well as the number of nodes and its label.
            edges_of_id = edges_group.get_group(graph_id)
            src = edges_of_id['src'].to_numpy()
            dst = edges_of_id['dst'].to_numpy()
            num_nodes = num_nodes_dict[graph_id]
            label = label_dict[graph_id]
            
            # Create a graph and add it to the list of graphs and labels.
            g = dgl.graph((src, dst), num_nodes=num_nodes)
            self.graphs.append(g)
            self.labels.append(label)
            
        # Convert the label list to tensor for saving.
        self.labels = torch.LongTensor(labels)
        
    def __getitem__(self, i):
        return self.graphs[i], self.labels[i]
    
    def __len__(self):
        return len(self.graphs)
    
    def save(self):
        dgl.save_graphs(os.path.join(self.save_path, 'synthetic.dgl'), self.graphs, {'label': self.labels})
        
    def load(self):
        self.graphs, labels = dgl.load_graphs(os.path.join(self.save_path, 'synthetic.dgl'))
        self.labels = labels['label']

In [64]:
dataset = SyntheticDataset()
graph, label = dataset[0]
print(graph, label)

Graph(num_nodes=15, num_edges=45,
      ndata_schemes={}
      edata_schemes={}) tensor(0)
