# Make Your Own Dataset

This tutorial assumes that you already know [the basics of training a GNN for node classification](1_introduction.ipynb) and [how to create, load, and store a DGL graph](2_dglgraph.ipynb).

Goal of this tutorial:

* Create your own graph dataset for node classification, link prediction, or graph classification.

## `DGLDataset` Object Overview

Your custom graph dataset should subclass the `dgl.data.DGLDataset` object and implement the following methods:

* `__getitem__(self, i)`: retrieve the `i`-th example of the dataset.  An example often contains a single DGL graph
* `__len__(self)`: the number of examples in the dataset.

## Creating a Dataset for Node Classification from CSV

A node classification dataset often consists of a single graph, as well as its node and edge features.

Here we take a small dataset based on [Zachary's Karate Club network](https://en.wikipedia.org/wiki/Zachary%27s_karate_club).  It contains
* A `members.csv` file containing the attributes of all members, as well as their attributes.
* An `interactions.csv` file containing the pair-wise interactions between two club members.

We will treat the members as nodes and interactions as edges.  We will take age as a numeric feature of the nodes, affiliated club as the label of the nodes, and edge weight as a numeric feature of the edges.

<div class="alert alert-info">
    
**Note**: the original Zachary's Karate Club network does not have member ages.  The ages in this tutorial are generated synthetically for demonstrating how to add node features into the graph for dataset creation.
    
</div>

<div class="alert alert-info">
    
**Note**: in practice taking age directly as a numeric feature may not work well in machine learning; strategies like binning or normalizing the feature would work better.  Here we are directly taking the values as-is for simplicity.
    
</div>

In [13]:
import dgl
from dgl.data import DGLDataset
import pandas as pd
import torch
import os

class KarateClubDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='karate_club')
        
    def process(self):
        nodes_data = pd.read_csv('data/members.csv')
        edges_data = pd.read_csv('data/interactions.csv')
        node_features = torch.from_numpy(nodes_data['Age'].to_numpy())
        node_labels = torch.from_numpy(nodes_data['Club'].astype('category').cat.codes.to_numpy())
        edge_features = torch.from_numpy(edges_data['Weight'].to_numpy())
        edges_src = torch.from_numpy(edges_data['Src'].to_numpy())
        edges_dst = torch.from_numpy(edges_data['Dst'].to_numpy())
        
        self.graph = dgl.graph((edges_src, edges_dst), num_nodes=nodes_data.shape[0])
        self.graph.ndata['feat'] = node_features
        self.graph.ndata['label'] = node_labels
        self.graph.edata['weight'] = edge_features
        
    def __getitem__(self, i):
        return self.graph
    
    def __len__(self):
        return 1
    
    def save(self):
        dgl.save_graphs(os.path.join(self.save_path, 'karate.dgl'), self.graph)
        
    def load(self):
        (self.graph,), _ = dgl.load_graphs(os.path.join(self.save_path, 'karate.dgl'))

In [14]:
dataset = KarateClubDataset()
graph = dataset[0]

## Creating a Dataset for Link Prediction from CSV

## Creating a Dataset for Graph Classification from CSV