# Assignment

In this assignment, we will use the Cora citation network. Each node represents a paper, and each edge from node $i$ to $j$ represents the citation from $i$ to $j$. A field code is assigned to an individual paper, which is in the `field` column in the node table.
We will ignore the edge directionality and apply a graph embedding to the undirected network.

In [None]:
import pandas as pd
import numpy as np
from scipy import sparse

node_table = pd.read_csv(
    "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/cora/node_table.csv"
)
node_feature_table = pd.read_csv(
    "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/cora/node_features.csv"
)
edge_table = pd.read_csv(
    "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/cora/edge_table.csv",
    dtype={"src": np.int32, "trg": np.int32},
)
src, trg = tuple(edge_table[["src", "trg"]].values.T)

rows, cols = src, trg
nrows, ncols = node_table.shape[0], node_table.shape[0]
A = sparse.csr_matrix(
    (np.ones_like(rows), (rows, cols)),
    shape=(nrows, ncols),
).asfptype()

# Symmterize and binarize
A = A + A.T
A.data = A.data * 0 + 1

In [None]:
# Node features
node_features = node_feature_table.drop(columns=["node_id"]).values

# Node labels (field)
node_labels = node_table["field"].values  # Raw labels (str)
node_label_ids = np.unique(node_labels, return_inverse=True)[1]  # Integer labels

Additionally, we create a PyTorch version of the sparse matrix: 

In [None]:
# Use this function to convert scipy sparse matrix to torch sparse matrix
def to_torch_sparse(A):
    """Convert scipy sparse matrix to torch sparse matrix"""
    Atorch = torch.sparse_csr_tensor(A.indptr, A.indices, A.data, dtype=torch.float32)
    return Atorch


Atorch = to_torch_sparse(A)

---
**Question 1: Implement the graph convolutional layer (Eq. 2 in [the paper](https://openreview.net/pdf?id=SJU4ayYgl) with [ReLu activation](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) $\sigma$) by using pytorch, numpy, and scipy.** 

In [None]:
import torch
from torch.nn import Linear, LeakyReLU, Softmax
import torch.nn.functional as F
import numpy as np
from scipy.sparse import coo_matrix


class GraphConv(torch.nn.Module):
  def __init__(self, in_channels, out_channels, A):
    """Graph Convolution Layer

    Parameters
    ----------
    in_channels : int
     Input dimension
    out_channels : int
     Output dimension
    A : scipy.csr_matrix (n_nodes, n_nodes)
     Adjacency matrix
    """
    super(GraphConv, self).__init__()
    # Your code ----
    self.conv_mat = ...
    self.linear = ...
    self.act = ...
    # --------------

  def forward(self, x):
    """Forward pass

    Parameters
    ----------
    x : torch.Tensor (n_nodes, in_channels)
     Input node features

    Returns
    -------
    torch.Tensor (n_nodes, out_channels)
     Output node features
    """

    # Your code ----
    z = ...
    # --------------

    return z


# Test
def test_GCN():
  GraphConv(100, 50, A)


test_GCN()

---
**Question 2: Implement the Graph Convolutional Network by using the GCNLayer implemented above. Use at least one GCN layer, and the last layer must be the softmax layer.**


**Implementation guideline:**

You can implement any GCN architecture that involves at least one GCNLayer. Here is an example of GCN's architecture. 

1. This GCN starts with graph convolutional layers, which perform two convolutional operations on the input node feature vectors. The first graph convolutional layer transforms feature vectors from `in_channel` dimensions to `hidden_channel` dimensions, and the second convolution layer transforms the `hidden_channel` dimensional vector to `out_channel` dimensional vectors. (Apply two GCNLayer implemented above) 
2. After performing two convolutional operations, a linear layer is applied. This linear layer transforms `out_channel` dimensional vector to `out_channel` dimensional vector (Apply torch.nn.Linear). 
3. Finally, a soft-max layer is applied to transform the `out_channel` dimensional vector to `out_channel` vector representing a probability distribution over the output classes. The softmax function ensures that the output of the network can be interpreted as probabilities and that these probabilities sum to 1 (Apply torch.nn.Softmax). 

In [None]:
import torch
from torch.nn import Linear, LeakyReLU, Softmax
import torch.nn.functional as F
import numpy as np
from scipy.sparse import coo_matrix


class GraphConv(torch.nn.Module):
  def __init__(self, in_channels, out_channels, A):
    """Graph Convolution Layer

    Parameters
    ----------
    in_channels : int
     Input dimension
    out_channels : int
     Output dimension
    A : scipy.csr_matrix (n_nodes, n_nodes)
     Adjacency matrix
    """
    super(GraphConv, self).__init__()
    self.linear = Linear(in_channels, out_channels)
    self.act = LeakyReLU()
    Ahat = sparse.eye(A.shape[0]) + A
    indeg = np.array(Ahat.sum(axis=0)).reshape(-1)
    indeg_sqrt_inv = np.power(indeg, -0.5)
    Lhat = sparse.diags(indeg_sqrt_inv) @ Ahat @ sparse.diags(indeg_sqrt_inv)
    self.conv_mat = to_torch_sparse(Lhat)

  def forward(self, x):
    """Forward pass

    Parameters
    ----------
    x : torch.Tensor (n_nodes, in_channels)
     Input node features

    Returns
    -------
    torch.Tensor (n_nodes, out_channels)
     Output node features
    """

    # Your code ----
    z = self.linear(x)
    z = self.conv_mat @ z
    z = self.act(z)
    # --------------

    return z


# Test
def test_GCN():
  GraphConv(100, 50, A)


test_GCN()

---
**Question 3**


**Preparation:**
Suppose a task of classifying papers into fields based on the citation network structure and *node features*. You are given the field labels for 80% of the papers. And the task is to classify the remaining 20\% of the papers.

First, we will reserve 80% of the data for training and the remaining 20% for evaluating the performance.

In [None]:
# Split the node table into the train and test set.
df = node_table.sample(frac=1, random_state=0)
train_node_table = df.iloc[: int(len(df) * 0.8)]
test_node_table = df.iloc[int(len(df) * 0.2) :]

We will evaluate the classification performance by the accuracy:

In [None]:
def eval_prediction_accuracy(y, yred):
    """Calculate prediction accuracy.

    Parameters
    ----------
    y : numpy.ndarray
     True labels.
    ypred : numpy.ndarray
     Predicted labels.

    Returns
    -------
    acc : float
     Prediction accuracy.
    """
    return float(np.sum(y == yred)) / float(len(y))

We will use the GCN implemented above for classification. 

In [None]:
n_labels = len(np.unique(node_labels))

in_channel = node_features.shape[1]
hidden_channel = 100
out_channel = n_labels
model = GCN(
    in_channel=in_channel, hidden_channel=hidden_channel, out_channel=out_channel, A=A
)

Here is how we train the model:

In [None]:
from torch import optim
from tqdm.auto import tqdm


# Define training loop
def train(model, optimizer, criterion, x_train, y_train, A, train_mask):
    """Train the model

    Parameters
    ----------
    model : torch.nn.Module
     Model
    optimizer : torch.optim.Optimizer
      Optimizer
    criterion : torch.nn.modules.loss._Loss
      Loss function
    x_train : torch.Tensor (n_nodes, in_channels)
      Input node features
    y_train : torch.Tensor (n_nodes)
      True labels
    A : scipy.sparse.csr_matrix (n_nodes, n_nodes)
      Adjacency matrix
    train_mask : numpy.ndarray
      Mask for training nodes

    Returns
    -------
    loss.item() : float
      Loss value
    """

    # Reset gradient
    optimizer.zero_grad()

    # Forward pass
    output = model(x_train)

    # Only compute loss for nodes in the training set
    loss = criterion(output[train_mask, :], y_train[train_mask])

    # Backward pass
    loss.backward()

    # Update parameters
    optimizer.step()

    # Return loss
    return loss.item()


# Define loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Find the indices of the train and test nodes
train_mask = np.array(train_node_table["node_id"])
test_mask = np.array(test_node_table["node_id"])

# Convert numpy arrays to torch tensors
X = torch.FloatTensor(node_features)
Y = torch.LongTensor(node_label_ids)

# Number of epochs to train
n_epochs = 500
pbar = tqdm(range(n_epochs))

# Train the model
model.train()
for epoch in pbar:
    loss = train(model, optimizer, criterion, X, Y, Atorch, train_mask)
    pbar.set_description(f"Epoch {epoch+1}, Loss: {loss:.4f}")

**Task: Generate the prediction using the trained model and evaluate the accuracy.**

In [None]:
# Evaluate the model
model.eval()

# Your code ----
# Hint
# output = ...
# ypred = ... # output gives a probability distribution over classes. Pick the one with the highest probability.
# --------------

acc = eval_prediction_accuracy(Y[test_mask].numpy(), ypred[test_mask])
print(f"Test accuracy: {acc:.4f}")

**Question 4: Train the GCN with random feature vectors generated from a Gaussian distribution and perform the classification.**

In [None]:
node_features_random = np.random.randn(node_table.shape[0], 100)

And reset the model.

In [None]:
in_channel = node_features_random.shape[1]
hidden_channel = 100
out_channel = n_labels
model = GCN(
    in_channel=in_channel, hidden_channel=hidden_channel, out_channel=out_channel, A=A
)

In [None]:
# Your code ----

# --------------
acc = eval_prediction_accuracy(Y[test_mask].numpy(), ypred[test_mask])
print(f"Test accuracy: {acc:.4f}")

You should see that performance decreases. The difference in performance compared to the GCN with raw node features is attributed to the utilization of node features.