# Challenge - Graph Convolution Networks

![](https://images.unsplash.com/photo-1503365070998-37e56a2606e2?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

In the exercise below, you will be asked to play around with the official implementation of Graph Convolutional Network (GCN). The code has been adapted so that you are able to run it in the Juypter notebook directly, and not through command lines. 

In this example, you will:
- load a graph
- try to make predictions through classical MLP
- try to make predictions through GCNs 
- observe the results
- read the code of the implementation
- fine-tune the parameters
- repeat it for other networks

!! It is required to have Tensorflow 2.0 for this exercicse !!

In [None]:
import time
import tensorflow.compat.v1 as tf
import matplotlib.pyplot as plt
from gcn.utils import *
from gcn.models import GCN, MLP

tf.disable_eager_execution()

In [None]:
# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)

In [None]:
# Constants
DATASET = "cora" # Dataset string.
LR = 0.01 # Initial learning rate.
HIDDEN_1 = 16 # Number of units in hidden layer 1.
EPOCHS = 200 # Number of epochs to train.
DROPOUT = 0.5 # Dropout rate (1 - keep probability).
WEIGHT_DECAY = 5e-4 # Weight for L2 loss on embedding matrix.
EARLY_STOPPING = 10 # Tolerance for early stopping (# of epochs)
MAX_DEGREE = 3 # Maximum Chebyshev polynomial degree.'

# I. Introduction

**Q1** : The code below will load the graph which has previously been saved as a Pickle object. Take the "graph" variable and plot it using NetowrkX. 

The different extensions here are pre-processing of the graph. They correspond to:
- x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;
- tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;
- allx => the feature vectors of both labeled and unlabeled training instances (a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object;
- y => the one-hot labels of the labeled training instances as numpy.ndarray object;
- ty => the one-hot labels of the test instances as numpy.ndarray object;
- ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object;
- graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict object;

In [None]:
names = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
objects = []
for i in range(len(names)):
    with open("gcn/data/ind.cora.{}".format(names[i]), 'rb') as f:
        if sys.version_info > (3, 0):
            objects.append(pkl.load(f, encoding='latin1'))
        else:
            objects.append(pkl.load(f))

x, y, tx, ty, allx, ally, graph = tuple(objects)
test_idx_reorder = parse_index_file("gcn/data/ind.cora.test.index")
test_idx_range = np.sort(test_idx_reorder)

Plot the object `graph` using networkx.

In [None]:
# TODO : plot the graph with networkx

**Q2** : To learn more about the dataset, you can check this link: https://relational.fit.cvut.cz/dataset/CORA. We will be using the CORA citation network. The network is directed. Nodes represent scientific papers. An edge between two nodes indicates that the left node cites the right node.

**Q3**: What is the shape of the training feature vector? And of the test feature vector? What do the features represent? Comment on the ratio of training and test data.

In [None]:
# TODO : what is the shape of `x` ?

In [None]:
# TODO : what is the shape of `tx` ?

There are 140 training data, and 1000 test ones. The features correspond to individual words present in the vocabulary (1 or 0). Since there is 7 times more testing than training data, this is a typical semi-supervised learning problem.

**Q4**: What is the shape of the labels?

In [None]:
# TODO : What is the shape of `y` ?

# II. Multi-layer Perceptron approach

**Q5**: You will now use the pre-built functions for the Multi-layer Perceptron. Read the code, and check the source code that is given within the `gcn` folder. Feel free to comment the code, and change the basic parameters defined above. The code below will: load the data, process it, define the model, the evaluation function, and initialize the session.

In [None]:
# Load data
adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask = load_data(DATASET)

In [None]:
features = preprocess_features(features)
support = [preprocess_adj(adj)]
num_supports = 1
model_func = MLP

In [None]:
# Define placeholders
placeholders = {
    'support': [tf.sparse_placeholder(tf.float32) for _ in range(num_supports)],
    'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(features[2], dtype=tf.int64)),
    'labels': tf.placeholder(tf.float32, shape=(None, y_train.shape[1])),
    'labels_mask': tf.placeholder(tf.int32),
    'dropout': tf.placeholder_with_default(0., shape=()),
    'num_features_nonzero': tf.placeholder(tf.int32)  # helper variable for sparse dropout
}

# Create model
model = model_func(placeholders, LR, HIDDEN_1, WEIGHT_DECAY, input_dim=features[2][1], logging=True)

# Initialize session
sess = tf.Session()

# Define model evaluation function
def evaluate(features, support, labels, mask, placeholders):
    t_test = time.time()
    feed_dict_val = construct_feed_dict(features, support, labels, mask, placeholders)
    outs_val = sess.run([model.loss, model.accuracy], feed_dict=feed_dict_val)
    return outs_val[0], outs_val[1], (time.time() - t_test)

# Init variables
sess.run(tf.global_variables_initializer())

cost_val = []

**Q6**: The code below will run the model training. Try to understand it. Then, plot the results stored in the lists `plot_acc` and `plot_val_acc`.

In [None]:
plot_val_acc = []
plot_acc = []

for epoch in range(EPOCHS):

    t = time.time()
    # Construct feed dictionary
    feed_dict = construct_feed_dict(features, support, y_train, train_mask, placeholders)
    feed_dict.update({placeholders['dropout']: DROPOUT})

    # Training step
    outs = sess.run([model.opt_op, model.loss, model.accuracy], feed_dict=feed_dict)

    # Validation
    cost, acc, duration = evaluate(features, support, y_val, val_mask, placeholders)
    cost_val.append(cost)
    
    plot_val_acc.append(acc)
    plot_acc.append(outs[2])
    
    # Print results
    print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(outs[1]),
          "train_acc=", "{:.5f}".format(outs[2]), "val_loss=", "{:.5f}".format(cost),
          "val_acc=", "{:.5f}".format(acc), "time=", "{:.5f}".format(time.time() - t))

    if epoch > EARLY_STOPPING and cost_val[-1] > np.mean(cost_val[-(EARLY_STOPPING+1):-1]):
        print("Early stopping...")
        break


In [None]:
# TODO : plot the accuracy on the train test and on the validation test

**Q7**: What do you notice?

There is a large overfitting after only 6-8 epochs. Overall, the performance for node classification is test is really limited.

**Q8**: The code below will run for you an evaluation of the MLP, and give you the performance on the y_test that was set aside.

In [None]:
test_cost, test_acc, test_duration = evaluate(features, support, y_test, test_mask, placeholders)
print("Test set results:", "cost=", "{:.5f}".format(test_cost),
      "accuracy=", "{:.5f}".format(test_acc), "time=", "{:.5f}".format(test_duration))

sess.close()

# III. Graph Convolutional Network Approach

**Q9**: Now, as previously, you will be asked to run the code below, try to understand it, look at the source code and comment it if you want.

In [None]:
support = [preprocess_adj(adj)]
num_supports = 1
model_func = GCN

In [None]:
# Define placeholders
placeholders = {
    'support': [tf.sparse_placeholder(tf.float32) for _ in range(num_supports)],
    'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(features[2], dtype=tf.int64)),
    'labels': tf.placeholder(tf.float32, shape=(None, y_train.shape[1])),
    'labels_mask': tf.placeholder(tf.int32),
    'dropout': tf.placeholder_with_default(0., shape=()),
    'num_features_nonzero': tf.placeholder(tf.int32)  # helper variable for sparse dropout
}

# Create model
model = model_func(placeholders, LR, HIDDEN_1, WEIGHT_DECAY, input_dim=features[2][1], logging=True)

# Initialize session
sess = tf.Session()


# Define model evaluation function
def evaluate(features, support, labels, mask, placeholders):
    t_test = time.time()
    feed_dict_val = construct_feed_dict(features, support, labels, mask, placeholders)
    outs_val = sess.run([model.loss, model.accuracy], feed_dict=feed_dict_val)
    return outs_val[0], outs_val[1], (time.time() - t_test)


# Init variables
sess.run(tf.global_variables_initializer())

cost_val = []

**Q9**: The code below will run the model training. Try to understand it. Then, plot the results stored in the lists `plot_acc` and `plot_val_acc`.

In [None]:
plot_val_acc = []
plot_acc = []

for epoch in range(EPOCHS):

    t = time.time()
    # Construct feed dictionary
    feed_dict = construct_feed_dict(features, support, y_train, train_mask, placeholders)
    feed_dict.update({placeholders['dropout']: DROPOUT})

    # Training step
    outs = sess.run([model.opt_op, model.loss, model.accuracy], feed_dict=feed_dict)

    # Validation
    cost, acc, duration = evaluate(features, support, y_val, val_mask, placeholders)
    cost_val.append(cost)
    
    plot_val_acc.append(acc)
    plot_acc.append(outs[2])
    
    # Print results
    print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(outs[1]),
          "train_acc=", "{:.5f}".format(outs[2]), "val_loss=", "{:.5f}".format(cost),
          "val_acc=", "{:.5f}".format(acc), "time=", "{:.5f}".format(time.time() - t))

    if epoch > EARLY_STOPPING and cost_val[-1] > np.mean(cost_val[-(EARLY_STOPPING+1):-1]):
        print("Early stopping...")
        break


In [None]:
# TODO : plot the accuracy on the train test and on the validation test

**Q11**: What do you notice?

Although there is also a slight overfitting, the overall model performance seems much better.

**Q12**: The code below will run for you an evaluation of the MLP, and give you the performance on the y_test that was set aside.

In [None]:
test_cost, test_acc, test_duration = evaluate(features, support, y_test, test_mask, placeholders)
print("Test set results:", "cost=", "{:.5f}".format(test_cost),
      "accuracy=", "{:.5f}".format(test_acc), "time=", "{:.5f}".format(test_duration))

Through this example, we demonstrated how much taking network structure into account during the training can be important, and improve the model performance by close to 25%.