From now on, many excercises will require a longer training time. To shorten it, you need to start getting familiar with Google Colab. It is possible to run script using the Google GPU (to do that in a colab notebook go to Runtime -> Change runtime type -> Select GPU).

In [None]:
# Uncomment these lines if on colab
#!pip install dgl-cu100
#!pip install --upgrade tables

In [1]:
import dgl
import pandas as pd
import torch
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import torch.nn as nn
import torch.optim as optim
from tqdm.notebook import tqdm
import ast

ModuleNotFoundError: No module named 'dgl'

## Exercise 3, part 1

Goals of this assignment:

1. Basic introduction to the DGL library https://www.dgl.ai/
2. Train a classifier that takes a point cloud as input (you must achieve validation accuracy over 85%)
3. Learn to work with dgl graphs, node data and batching
4. Build a deepset architecture with dgl mean_nodes and broadcast_nodes functions

First step, download the dataset. It's a modified version of the MNIST dataset where the images have been converted to point clouds.

<b> The task is to classify each graph and say which number it represents. </b>

In [None]:
!unzip ../../Datasets/Dataset_MNIST.zip

In [None]:
%load_ext autoreload
%autoreload 2

### DataSet

It is already built in, but it's good to have a look at how DLG graph is created. The graphs:

* have no edges, they are just a collection of nodes;

* their nodes have a feature which is named "xy", which represents the position of the node in 2D space.

The dataset will return a graph and a target class (from 0 to 9).

In [None]:
training_df = pd.read_hdf('Dataset/training_ds.h5')

In [None]:
training_df.head()

In [None]:
labels = torch.LongTensor(training_df.label)
n_points = training_df.n_points.values

In [None]:
labels

In [None]:
n_points

Import the Dataloader already created now!

In [None]:
from dataloader import PointCloudMNISTdataset, collate_graphs

In [None]:
training_dataset = PointCloudMNISTdataset('Dataset/training_ds.h5')
validation_dataset = PointCloudMNISTdataset('Dataset/valid_ds.h5')

In [None]:
g, y = training_dataset[663]

In [None]:
# Graph 663 has 93 nodes, no edges and each node is associated with a property 'xy'

g, y

In [None]:
#To see the 2D array of coordinates
#g.ndata['xy']

In [None]:
fig,ax = plt.subplots(figsize=(4,4))

xy = g.ndata['xy'].data.numpy()

ax.scatter( xy[:,0],xy[:,1] )

ax.set_ylabel('Y ',fontsize=20,rotation=0)
ax.set_xlabel('X',fontsize=20)
ax.set_xlim(-1,1)
ax.set_ylim(-1,1)
plt.show()

###  How to batch?

We need to batch our data in a "special" way, we have to tell the pytorch dataloader how to do it. We do this with the collate_graphs function defined in dataloader.py. It uses a function called dgl.batch( ). The batched graph includes all the nodes from all the graphs - and dgl keeps track of which nodes belong to each graph.

In [None]:
from torch.utils.data import Dataset, DataLoader

data_loader = DataLoader(training_dataset, batch_size=300, shuffle=True,
                         collate_fn=collate_graphs)
validation_data_loader = DataLoader(validation_dataset, batch_size=300, shuffle=False,
                         collate_fn=collate_graphs)

In [None]:
for batched_g,y in data_loader:
    break

In [None]:
batched_g,y

In [None]:
batched_g.batch_num_nodes()

## The model: DeepSets

### Explanation of the structure

A possible model is DeepSets (feel free to implement this or change it). 

<img src="deepset.jpeg" width="800" height="400">

In [None]:
for batched_g,y in data_loader:
    break

In [None]:
type(batched_g.ndata['xy']), batched_g.ndata['xy'].shape

The input array is N points with features (in this case xy, so 2 dimensions). We need to be able to apply a network to each one of the nodes in the graph. We do that by applying a linear layer to the node features.

In [None]:
# Example of creation of the embedding

linear_layer = nn.Linear(2,10)
# You store the output on the graph itself
batched_g.ndata['hidden rep'] = linear_layer(batched_g.ndata['xy'])

In [None]:
batched_g.ndata['hidden rep'].shape

Next, we need to be able to take the mean of the hidden represenations in each graph - dgl has a function to do that dgl.mean_nodes( ). This function knows that our graph is a batch of different graphs.

In [None]:
# It extrapolates the graph you are interested in from the total graph created
mean_of_node_rep = dgl.mean_nodes(batched_g,'hidden rep')
mean_of_node_rep.shape

We need to be able to "broadcast" this global mean back to each of the individual nodes, so that they are "aware" of the rest of the graph.

In [None]:
broadcasted_sum = dgl.broadcast_nodes(batched_g,mean_of_node_rep)

In [None]:
broadcasted_sum.shape

We assign this broadcasted global rep as a feature of the nodes

In [None]:
batched_g.ndata['global rep'] = broadcasted_sum

Now we can use it as input for a new linear layer, and we can update the hidden rep for each node. Now the hidden rep for each node contains information from the entire graph

In [None]:
linear_layer2 = nn.Linear(10*2,10)

input_to_layer = torch.cat([
                            batched_g.ndata['global rep'], 
                            batched_g.ndata['hidden rep']],dim=1)

batched_g.ndata['hidden rep'] = linear_layer2(input_to_layer)

In [None]:
batched_g.ndata['hidden rep'].shape

### What model should you build?

The final model should take a graph as input and return a vector of length 10 (remember this is MNIST and our task is to classify digits. This model worked for me, feel free to do whatever you desire.

<img src="model_example.jpeg" width="800" height="400">

In [None]:
from model import Net

In [None]:
net = Net()
net

In [None]:
for batched_g,y in data_loader:
    break

In [None]:
net(batched_g).shape

## Training and testing the model

Remeber, the threshold is 85%!!!

Nothing changes here (the 'CUDA' parts will allow you to use the GPU on colab)

In [None]:
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.00005) 

In [None]:
def compute_accuracy_and_loss(dataloader,net):
    total = 0
    correct = 0
    
    loss = 0
    
    if torch.cuda.is_available():
        net.cuda()
    net.eval()
    
    n_batches = 0
    with torch.no_grad():
        for batched_g,y in dataloader:
            n_batches+=1
            
            if torch.cuda.is_available():
                batched_g = batched_g.to(torch.device('cuda'))
                y = y.cuda()
            pred = net(batched_g)
            
            loss+= loss_func(pred,y).item()
            
            pred = torch.argmax(pred,dim=1)

            correct+=len(torch.where(pred==y)[0])
            total+=len(y)
    loss = loss/n_batches      
    return correct/total, loss

In [None]:
if torch.cuda.is_available():
    net.cuda()

In [None]:
# Run it on colab
# you have to import the .py files and afterwards you need to download the trained_model.pt that it produces

if torch.cuda.is_available() == True:

    n_epochs = 30

    training_loss_vs_epoch = []
    validation_loss_vs_epoch = []

    training_acc_vs_epoch = []
    validation_acc_vs_epoch = []

    pbar = tqdm( range(n_epochs) )

    for epoch in pbar:

        if len(validation_loss_vs_epoch) > 1:
            pbar.set_description('val acc:'+'{0:.5f}'.format(validation_acc_vs_epoch[-1])+', train acc:'+'{0:.5f}'.format(training_acc_vs_epoch[-1]))

        net.train() # put the net into "training mode"
        for batched_g,y in data_loader:
            if torch.cuda.is_available():
                batched_g = batched_g.to(torch.device('cuda'))
                y = y.cuda()

            optimizer.zero_grad()
            pred = net(batched_g)
            loss = loss_func(pred,y)
            loss.backward()
            optimizer.step()

        net.eval() #put the net into evaluation mode
        train_acc, train_loss = compute_accuracy_and_loss(data_loader,net)
        valid_acc, valid_loss = compute_accuracy_and_loss(validation_data_loader,net)

        training_loss_vs_epoch.append(train_loss)    
        training_acc_vs_epoch.append(train_acc)

        validation_acc_vs_epoch.append(valid_acc)

        validation_loss_vs_epoch.append(valid_loss)
        if len(validation_loss_vs_epoch)==1 or validation_loss_vs_epoch[-2] > validation_loss_vs_epoch[-1]:
            torch.save(net.state_dict(), 'trained_model.pt')

In [None]:
if torch.cuda.is_available() == True:
    
    fig,ax = plt.subplots(1,2,figsize=(8,3))

    ax[0].plot(training_loss_vs_epoch,label='training')
    ax[0].plot(validation_loss_vs_epoch,label='validation')

    ax[1].plot(training_acc_vs_epoch)
    ax[1].plot(validation_acc_vs_epoch)

    plt.show()

In [None]:
net.load_state_dict(torch.load('trained_model.pt',map_location='cpu'))

In [None]:
from evaluate import *

In [None]:
evaluate_on_dataset('Dataset/valid_ds.h5')