# Machine Learning for Graphs - Tutorial B: The Graph Convolutional Network

In [1]:
# %pip install nbconvert[webpdf]

Fill in your names and group number here:

**NAME STUDENT A :** Tianzheng Hu (2760270)

**NAME STUDENT B :** Lijing Luo (2794795)

**GROUP NUMBER :** 7


Implementing a machine learning experiment with graph data is an important skill that you will learn as part of this course. This hands-on tutorial will help you develop this skill, as well as help you familiarize yourself with many of the steps and techniques that you will likely need to use for your final project.

Representation learning is the task of learning sensible representations for your samples given some downstream task. On graphs, representation learning is commonly used to learn vector representations of the nodes. These node representations are often called *embeddings vectors* or just *embeddings*. Graph Neural Networks (GNN) are ideal for learning node embeddings, since the identity of a node is a function of its neighbourhood (up to depth *d*) and since GNNs learn internal node representations by applying an aggregation operator on exactly this neighbourhood. Different models with various choices of aggregation operator have been introduced over the past couple of years, with the *convolutional* and *attention* operators being the more popular choices.

For this tutorial, you are asked to implement the original *Graph Convolutional Network* (GCN) and to replicate some of the classification experiments from the [paper](https://arxiv.org/abs/1609.02907) that introduced it [1]. To help you on your way, we have already prepared this Python Notebook.

You are asked to team up with another student and to work together on this tutorial. Please register your team by creating a new group and by adding both members.

    [1] Kipf, T. N., & Welling, M. Semi-supervised Classification With Graph Convolutional Networks (2017).
---

## NumPy and PyTorch

In this course we will make use of the [NumPy](https://numpy.org) package for working with vector data, and the [PyTorch](https://pytorch.org) machine learning package. Both of these are probably already installed in your environment as part of the first tutorial (Numpy as a dependency of PyTorch) but if this is not the case then running the following cell will install these packages for you.

**Run the cell below to install the NumPy and PyTorch packages in your Python environment**

In [2]:
# %pip install numpy torch

**Run the cells below to import the necessay packages and to set a manual seed**

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

seed = 42  # for reproducability
torch.manual_seed(seed)

<torch._C.Generator at 0x7f7db8c53c10>

## Data Preparation

In the previous tutorial we used the *RDFlib* package to import the dataset. That the dataset was encoded using an open standard made this possible. This is not always the case, however: it is very common to come accross graph datasets that use an arbitrary encoding. In the case of the *Cora* dataset loaded below, the graph has been stored in two parts: the first a set of integer-encoded edges $[i, j]$, with $i$ and $j$ the indices of the nodes, and the second as a set of *n*-hot encoded node representations. Being a citation graph, the edges convey who cites who, whereas the node vectors $e_i$ represent a sparse bag-of-words with vocabulary $\Omega$ for which holds that $e_i[j] = 1$ if word $\Omega[j]$ occurs in the document and $0$ otherwise.

To import the Cora dataset we first process the raw files using NumPy and cast the generated arrays to the correct datatypes. Next, we generate a node-to-integer map and reindex the edges to ensure that their node identifiers match those of the nodes.

**Run the following cells to import and process the data**

In [4]:
import numpy as np
path = './data/'

data = np.genfromtxt(path + "cora.content", dtype = str)
edges = np.genfromtxt(path + "cora.cites", dtype = int)

In [5]:
# these have the same order
features = data[:, 1:-1].astype(int)
labels = data[:, -1]
nodes = data[:, 0].astype(int)

n2i = {n:i for i,n in enumerate(nodes)}
edges_reindexed = np.array([[n2i[source], n2i[target]] for source, target in edges])

num_nodes = len(nodes)
num_edges = len(edges)
num_features = len(features[0])

In [6]:
nodes, nodes.shape, edges, len(edges), features, len(features), (edges_reindexed).shape, labels, len(labels)

(array([  31336, 1061127, 1106406, ..., 1128978,  117328,   24043]),
 (2708,),
 array([[     35,    1033],
        [     35,  103482],
        [     35,  103515],
        ...,
        [ 853118, 1140289],
        [ 853155,  853118],
        [ 954315, 1155073]]),
 5429,
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]),
 2708,
 (5429, 2),
 array(['Neural_Networks', 'Rule_Learning', 'Reinforcement_Learning', ...,
        'Genetic_Algorithms', 'Case_Based', 'Neural_Networks'],
       dtype='<U22'),
 2708)

In [7]:
# inspect the data
print(f"Number of nodes: {num_nodes}")
print(f"Number of edges: {num_edges}")
print(f"Number of features: {num_features}\n")

for i in range(5):
    print(f"Node ID: {nodes[i]}")
    print(f"Node features: {features[i]}")
    print(f"Node label: {labels[i]}\n")
    
print(f"Edges: \n{edges_reindexed[:5]}")

Number of nodes: 2708
Number of edges: 5429
Number of features: 1433

Node ID: 31336
Node features: [0 0 0 ... 0 0 0]
Node label: Neural_Networks

Node ID: 1061127
Node features: [0 0 0 ... 0 0 0]
Node label: Rule_Learning

Node ID: 1106406
Node features: [0 0 0 ... 0 0 0]
Node label: Reinforcement_Learning

Node ID: 13195
Node features: [0 0 0 ... 0 0 0]
Node label: Reinforcement_Learning

Node ID: 37879
Node features: [0 0 0 ... 0 0 0]
Node label: Probabilistic_Methods

Edges: 
[[ 163  402]
 [ 163  659]
 [ 163 1696]
 [ 163 2295]
 [ 163 1274]]


## Task 1: Vectorizing the graph

Since graph neural networks aggregate the information from the neighbourhoods of nodes, they need to know which nodes are adjacent to which other nodes. Because the information from those neighbours must also be aggregated from *their* neighbourhoods, these models thus need a relatively large amount of information about the structure of a graph. This information comes in the form of an *adjacency matrix* $A$, such that $A[i,j] = 1$ if there exists a link between nodes $i$ and $j$, and $0$ otherwise.

Of course, the adjacency matrix only tells the model which nodes to aggregate. To also know *what* to aggregate, we need another matrix which uniquely identifies each node. This matrix is often called the *node feature matrix* $X$. If our nodes comes with one or more attributes, or *features*, then we can fill up this matrix with the corresponding values. This is commonly done with *multimodal learning*. More often, however, it is easier to just ignore the node features (if any), and to let $X$ equal the identity matrix $I$ such that $X[i,j] = 1$ iff $i = j$ and $0$ otherwise.

Finally, since the downstream task is *node classification*, we need a vector representation, the *target vector* $y$, for the class labels that are used to compute the loss and accuracy scores. Since we need to calculate the gradients during this step, we need a numerical encoding for the labels. 

### Task 1a: Creating a feature matrix

Write a procedure to generate a node feature matrix that maps each node to its respective feature vector. The result should be a *sparse* float tensor `X`, such that `X[i]` refers to the feature vector of node `i`. Since the Cora dataset comes with integer-encoded node features (the bag-of-words) there is no need to generate an indentity matrix. Remember that the whole set of features is stored in variable `features`.

In [8]:
X = torch.from_numpy(features).float()

In [9]:
X.shape

torch.Size([2708, 1433])

Run the following code to check your feature matrix

In [10]:
# Check your feature matrix
X.to_dense()[:10,:10]

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

### Task 1b: Create an adjacency matrix

Write a procedure to generate the adjacency matrix for the Cora graph. The result should be a *sparse* float tensor `A`, such that `A[i,j]` equals `1` if there exists an edge between nodes `i` and `j`, and `0` otherwise. Be aware that the GCN requires all nodes to have a reflexive edge (loops) which ensures that the nodes remember their previous state when updating.  

In [11]:
from scipy.sparse import identity

adjacency_matrix= np.zeros((num_nodes,num_nodes), dtype = int)
identity_matrix = identity(num_nodes)

for i in range(len(edges_reindexed)):
    row = edges_reindexed[i][0]
    col = edges_reindexed[i][1]
    adjacency_matrix[row][col] = 1
    

# res = (adjacency_matrix==adjacency_matrix.T).all()

adjacency_matrix = adjacency_matrix + identity_matrix
A = torch.from_numpy(adjacency_matrix)

Run the following code to check your adjacency matrix

In [12]:
# Check your adjacency matrix by using the sum as proxy
print(f"The number of connections, {int(A.sum())}, must equal the number of edges, {num_edges}," 
      f" plus the number of nodes, {num_nodes}")
A.to_dense()[:10,:10]

The number of connections, 8137, must equal the number of edges, 5429, plus the number of nodes, 2708


tensor([[1., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=torch.float64)

### Task 1c: Create the target vector

Write a procedure to generate the target vector with integer-encoded class labels. The result should be a `long` vector `y_true`, such that `y_true[i]` holds the target label of node `i`. Note that, with PyTorch, different loss functions require differently formatted target vectors.

In [13]:
encoded_labels = {ids: i for i, ids in enumerate(set(labels))}
y_true = [encoded_labels[label] for label in labels]

num_labels = np.unique(labels)

In [14]:
num_labels.size

7

Run the following code to check your target vector

In [15]:
print(f'number of unique labels: {num_labels}\n')

print(f'y: {y_true[:10]}')

number of unique labels: ['Case_Based' 'Genetic_Algorithms' 'Neural_Networks'
 'Probabilistic_Methods' 'Reinforcement_Learning' 'Rule_Learning' 'Theory']

y: [5, 0, 3, 3, 4, 4, 1, 5, 5, 1]


## Task 2: Partition the dataset

To properly perform our experiments we first need to partition our data into a _train_ and _test_ split. These splits are used to train and test our model, respectively, and must be disjoint to avoid information leakage. Ideally, we would als create a _validation_ split to use for model selection and/or hyperparameter optimization, but we dispense with that for now.

Create a procedure to create a train and test split with a ratio of 4 to 1. The result should be two vectors, `train_idx` and `test_idx`, that contain indices that point to the actual data (a _mask_) that are randomly drawn from the set of all indices.

In [16]:
num_train = num_nodes // 4
num_test = num_nodes - num_train

# use mask
# Generate a mask vector with 1/4 numbers as 1 and others as 0
train_mask_vector = np.ones(num_nodes)
train_mask_vector[:num_train] = 0
np.random.shuffle(train_mask_vector)

# Generate another vector in the opposite way
test_mask_vector = 1 - train_mask_vector

train_mask_vector[:10],test_mask_vector[:10]

train_idx = train_mask_vector
test_idx = test_mask_vector

train_idx, test_idx

train_mask = [value == 1 for value in train_idx]
test_mask = [value == 1 for value in test_idx]

train_mask, test_mask

([False,
  False,
  False,
  True,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  True,
  True,
  False,
  True,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  True,
  True,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  True,
  True,
  False,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  False,
  True,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  False,
  True,
  False,
  True,
  True,
  False,
  True,
  True,
  True,
  True,
  True,
  False,
  True,
  True,
  True,
  False,
  False,
  True,
  True,
  True,
  False,
  True,
  False,
  True,
  False,
  False,
  T

Run the following code to check your partitions

In [17]:
print(f"number of training samples: {num_train}")
print(f"number of testing samples: {num_test}")

print(f"\ntrain indices:\n{train_idx[:5]}")
print(f"\ntest indices:\n{test_idx[:5]}")

number of training samples: 677
number of testing samples: 2031

train indices:
[0. 0. 0. 1. 0.]

test indices:
[1. 1. 1. 0. 1.]


## The Graph Convolutional

The *Graph Convolutional Network* (GCN) is arguably the first major breakthrough in GNN development. Developed in 2017, the GCN introduces the idea of the *spectral graph convolution*, which, analogues to its visual counterpart, aggregates the information surrounding an object. In the case of *Convolutional Neural Networks* (CNN), these objects are pixels, whereas with the GCN these are nodes. This comparison becomes evident when you consider images as regular (grid-shaped) graphs with pixel as nodes.

The GCN is defined as a network with one or more *Graph Convolution* layers. Each of these layers applies the convolution operator to its input, and is defined as 

$$ H^{l+1} = \sigma(\tilde{D}^{- \frac{1}{2}} \tilde{A} \tilde{D}^{- \frac{1}{2}} H^l W^l) $$

where $\tilde{A}$ is the adjacency matrix with reflexive edges, $\tilde{D}$ the degree matrix derived from $\tilde{A}$, $H^l$ the internal node representations of layer $l$, $W^l$ the weight matrix of layer $l$, and $\sigma$ a nonlinearity like $ReLU$. Note that the initial node representation matrix $H^0 = X$.

In the experiments that we are reproducing the GCN is used for the task of node classification. For this purpose, the GCN is given two graph convolution layers, but with the nonlinearity of the last layer replaced by a softmax function:

$$ y = softmax(\hat{A}~\sigma(\hat{A} X W^0)~W^1) $$

with $\hat{A} = \tilde{D}^{- \frac{1}{2}} \tilde{A} \tilde{D}^{- \frac{1}{2}}$

### Task 3a: Implement the Graph Convolution

Implement the graph convolution layer as a subclass of PyTorch `nn.Module`. Concretely, you must implement the `__init__` and `forward` functions. Ensure that the computation supports sparse tensors, and that the input and output dimensions can be set on initialisation.

In [18]:
from scipy.sparse import coo_matrix
from scipy.sparse import diags
from torch.nn import Parameter
def get_A_hat_torch(A):
    A_tilde = coo_matrix(A, dtype=float)
    degrees = A_tilde.sum(axis=1).flatten().A
    Diag_matrix = diags(degrees, list(range(len(degrees))), dtype=float)
    A_hat = (Diag_matrix.power(-0.5) @ A_tilde @ Diag_matrix.power(-0.5)).tocoo()


    # A as sparse PyTorch tensor
    indices = np.vstack((A_hat.row, A_hat.col))
    A_hat_torch = torch.sparse_coo_tensor(indices, A_hat.data, dtype=torch.float)
    return A_hat_torch


class GraphConvolutionLayer(nn.Module):
    """
    A single Graph Convolution Layer
    """

    def __init__(self, in_features, out_features):
        super().__init__()
        # your code here
        # Define learnable weight matrix
        self.W = Parameter(torch.Tensor(in_features, out_features))
        torch.nn.init.kaiming_uniform_(self.W)  # Xavier/Glorot initialization
        
    def forward(self, A, X) -> torch.Tensor:
        # your code here
        # Perform graph convolution: Z = ÂXW       
        X_hat = torch.spmm(A, X)
#         X_hat = torch.spmm(X, A)
        Z = torch.spmm(X_hat, self.W)

        return Z

In [19]:
# in_feature = X.shape[1]
# W = Parameter(torch.Tensor(in_feature, 48))
# torch.nn.init.kaiming_uniform_(W)  # Xavier/Glorot initi

# A_hat_torch = get_A_hat_torch(A)
# X_hat = torch.spmm(A_hat_torch, X)
# # X_hat = torch.spmm(X, A_hat)
# Z = torch.spmm(X_hat, W)

# A_hat_torch

Run the following cell to initialize and test your implementation


In [20]:
in_features = X.shape[1]
out_features = 48
A_hat_torch = get_A_hat_torch(A)

conv = GraphConvolutionLayer(in_features, out_features)
conv(A_hat_torch, X)

tensor([[-0.1822,  0.1992,  0.0685,  ..., -0.5589,  1.0460, -0.5471],
        [ 1.5105,  0.0549,  0.3791,  ...,  0.1812, -1.7140, -0.5104],
        [ 0.2074, -1.2355, -2.3261,  ..., -1.6745, -1.0768,  0.3438],
        ...,
        [ 1.3739, -0.9813,  0.9853,  ..., -0.4488, -0.1333,  0.5263],
        [ 1.1874, -1.6371, -1.0612,  ..., -0.5831, -0.0058,  0.2217],
        [ 0.8293, -0.3370,  0.5844,  ...,  0.3284, -0.0392, -1.2089]],
       grad_fn=<MmBackward0>)

### Task 3b: Implement the Graph Convolutional Model

Implement the GCN as specified in the paper [1]. Concretely, implement a two-layer GCN with a ReLU activation function and dropout after the first layer, and with a softmax layer after the second.

In [21]:
from scipy.sparse import coo_matrix
from scipy.sparse import diags
from torch.nn import Parameter
def get_A_hat_torch(A):
    A_tilde = coo_matrix(A, dtype=float)
    degrees = A_tilde.sum(axis=1).flatten().A
    Diag_matrix = diags(degrees, list(range(len(degrees))), dtype=float)
    A_hat = (Diag_matrix.power(-0.5) @ A_tilde @ Diag_matrix.power(-0.5)).tocoo()


    # A as sparse PyTorch tensor
    indices = np.vstack((A_hat.row, A_hat.col))
    A_hat_torch = torch.sparse_coo_tensor(indices, A_hat.data, dtype=torch.float)
    return A_hat_torch


class GraphConvolutionLayer(nn.Module):
    """
    A single Graph Convolution Layer
    """

    def __init__(self, in_features, out_features):
        super().__init__()
        # your code here
        # Define learnable weight matrix
        self.W = Parameter(torch.Tensor(in_features, out_features))
        torch.nn.init.kaiming_uniform_(self.W)  # Xavier/Glorot initialization
        
    def forward(self, A, X) -> torch.Tensor:
        # your code here
        # Perform graph convolution: Z = ÂXW       
#         X_hat = torch.spmm(A, X)
        X_hat = torch.spmm(X, A)
        Z = torch.spmm(X_hat, self.W)

        return Z

In [22]:
class GCN(nn.Module):
    def __init__(self, in_features, hidden_size, out_features):
        super().__init__()
        # your code here
        self.layer1 = GraphConvolutionLayer(in_features, hidden_size)
        self.layer2 = GraphConvolutionLayer(hidden_size, out_features)

    def forward(self, X, A) -> torch.Tensor:
        # your code here
        X_mid = F.relu(self.layer1(X, A))
        result = F.softmax(self.layer2(X_mid, A))
        return result

Run the following cell to initialize and test your implementation

In [23]:
in_features = X.shape[1]
hidden_size = 48
out_features = num_labels.size

model = GCN(in_features, hidden_size, out_features)

y_pred = model(X, get_A_hat_torch(A))
y_pred

  result = F.softmax(self.layer2(X_mid, A))


tensor([[4.0806e-02, 1.5808e-01, 3.2214e-03,  ..., 7.6862e-01, 4.8806e-03,
         2.0993e-02],
        [3.4422e-02, 9.8172e-03, 2.2864e-02,  ..., 7.5938e-01, 1.5408e-02,
         1.7496e-02],
        [1.2730e-01, 5.9436e-03, 4.9778e-01,  ..., 2.8078e-01, 1.7351e-02,
         5.9357e-02],
        ...,
        [5.9801e-04, 3.2077e-01, 2.5444e-04,  ..., 3.5504e-03, 7.1108e-04,
         6.6268e-01],
        [3.3892e-02, 1.1778e-02, 2.8148e-02,  ..., 8.4863e-01, 9.6374e-03,
         6.0665e-02],
        [3.9481e-02, 6.6859e-03, 7.6810e-02,  ..., 4.1324e-01, 8.6512e-03,
         3.5710e-01]], grad_fn=<SoftmaxBackward0>)

## Training and testing

In normal circumstances the GCN updates its internal representation for all nodes in the graph after each pass. In other words, the GCN operates on the entire graph at once, rather than on just the training, test, or validation set. Since these sets are disjoint, it necessarily means that only part of the class labels are available each time. This is called *semi-supervised learning*. Because the model sees the entire graph each pass, it still outputs predictions for all the nodes. However, by just calculating the loss and accuracy on a specific split, we ensure that only the error on the nodes in that split is backpropagated.

### Task 4: Implementing evaluation metrics

Write a procedure to calculate the loss *and* a procedure to calculate the accuracy. Assume that we have a tensor with true labels, `y_true`, and a tensor with predicted labels, `y_pred`.

In [24]:
def compute_accuracy(y_pred, y_true) -> float:
    # your code here
    y_pred_labels = torch.argmax(y_pred, dim=1)
    
    # Calculate the number of values that are the same in corresponding positions
    matching_values = sum(y_hat == y for y_hat, y in zip(y_pred_labels, y_true))
    accuracy = matching_values/len(y_pred_labels)
    
    return accuracy
    

def compute_loss(y_pred, y_true) -> torch.Tensor:
    # your code here
    m = nn.LogSoftmax(dim=1)
    loss = loss_function(m(y_pred), (y_true))
    return loss

In [25]:
y_pred.shape

torch.Size([2708, 7])

Run the following cell to test your code:

In [26]:
loss_function = nn.NLLLoss()

y_pred_labels = torch.argmax(y_pred, dim=1)
print(f'Predicted labels: {y_pred_labels[:10]}')
print(f'True labels: {y_true[:10]}')

acc = compute_accuracy(y_pred, torch.tensor(y_true))
print(f'Accuracy: {acc:.3f}')

loss = compute_loss(y_pred, torch.tensor(y_true))
print(f'Loss: {loss:.3f}')


Predicted labels: tensor([4, 4, 2, 6, 1, 3, 1, 4, 4, 4])
True labels: [5, 0, 3, 3, 4, 4, 1, 5, 5, 1]
Accuracy: 0.112
Loss: 2.004


### Task 5a: Implement the training loop

Write a procedure to train the model. Specifically, create a loop that passes the entire graph through the model every epoch, while computing the loss and accuracy on just the training set. Use the Adam optimizer and the negative log likelihood loss.

In [27]:
# set hyperparameters
learning_rate = 0.01
num_epoch = 200
in_features = X.shape[1]
hidden_size = 48
out_features = num_labels.size

model = GCN(in_features, hidden_size, out_features)
loss_function = nn.NLLLoss()
A_hat_torch = get_A_hat_torch(A)

# set optimizer
optimizer = torch.optim.Adam(model.parameters(),
                             lr = learning_rate)

for epoch in range(1, num_epoch+1):
    print(f'Epoch {epoch:3d} - ', end='')

    # allow model parameters to be learned   
    model.train()  

    # your code here
    pred = model(X, A_hat_torch)
#     loss = compute_loss(pred, y_true)
    loss = compute_loss(pred[torch.nonzero(torch.tensor(train_idx)).squeeze()], torch.tensor(y_true)[torch.nonzero(torch.tensor(train_idx)).squeeze()])
    
    acc = compute_accuracy(pred[torch.nonzero(torch.tensor(train_idx)).squeeze()], torch.tensor(y_true)[torch.nonzero(torch.tensor(train_idx)).squeeze()])

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    loss = float(loss)  # release memory of computation graph

    print(f'loss: {loss:0.4f}\tacc: {acc:0.4f}')

Epoch   1 - loss: 1.9501	acc: 0.1679
Epoch   2 - 

  result = F.softmax(self.layer2(X_mid, A))


loss: 1.8345	acc: 0.3274
Epoch   3 - loss: 1.7502	acc: 0.4215
Epoch   4 - loss: 1.6715	acc: 0.5175
Epoch   5 - loss: 1.6014	acc: 0.5879
Epoch   6 - loss: 1.5460	acc: 0.6529
Epoch   7 - loss: 1.5047	acc: 0.6967
Epoch   8 - loss: 1.4687	acc: 0.7336
Epoch   9 - loss: 1.4361	acc: 0.7597
Epoch  10 - loss: 1.4081	acc: 0.7799
Epoch  11 - loss: 1.3842	acc: 0.8001
Epoch  12 - loss: 1.3623	acc: 0.8203
Epoch  13 - loss: 1.3432	acc: 0.8395
Epoch  14 - loss: 1.3282	acc: 0.8513
Epoch  15 - loss: 1.3159	acc: 0.8661
Epoch  16 - loss: 1.3051	acc: 0.8720
Epoch  17 - loss: 1.2954	acc: 0.8823
Epoch  18 - loss: 1.2868	acc: 0.8902
Epoch  19 - loss: 1.2797	acc: 0.8941
Epoch  20 - loss: 1.2732	acc: 0.9000
Epoch  21 - loss: 1.2672	acc: 0.9065
Epoch  22 - loss: 1.2620	acc: 0.9114
Epoch  23 - loss: 1.2578	acc: 0.9138
Epoch  24 - loss: 1.2542	acc: 0.9183
Epoch  25 - loss: 1.2508	acc: 0.9207
Epoch  26 - loss: 1.2476	acc: 0.9242
Epoch  27 - loss: 1.2448	acc: 0.9261
Epoch  28 - loss: 1.2422	acc: 0.9286
Epoch  29 - l

### Task 5b: Implement the test procedure

Write a procedure to test the now-trained model. Ensure that the weights of your model are frozen during testing, and that the loss and accuracy scores are calculated on just the test set.

In [28]:
# freeze model parameters for evaluation
model.eval()

# your code here
pred = model(X, A_hat_torch)
loss = compute_loss(pred[torch.nonzero(torch.tensor(test_idx)).squeeze()], torch.tensor(y_true)[torch.nonzero(torch.tensor(test_idx)).squeeze()])
acc = compute_accuracy(pred[torch.nonzero(torch.tensor(test_idx)).squeeze()], torch.tensor(y_true)[torch.nonzero(torch.tensor(test_idx)).squeeze()])

loss = float(loss)  # release memory of computation graph

print(f'test loss: {loss:0.4f}\ntest acc: {acc:0.4f}')

test loss: 1.3308
test acc: 0.8346


  result = F.softmax(self.layer2(X_mid, A))
