In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

# Load the Cora dataset
dataset = Planetoid(root='data/Cora', name='Cora')

# Prepare data
data = dataset[0]

# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

def test(model):
    model.eval()
    with torch.no_grad():
        out = model(dataset.data)
        pred = out.argmax(dim=1)
        flag_is_correct = pred[dataset.data.test_mask] == dataset.data.y[dataset.data.test_mask]
        acc = int(flag_is_correct.sum()) / int(dataset.data.test_mask.sum())
        return acc, pred[dataset.data.test_mask], dataset.data.y[dataset.data.test_mask]


acc_lst = []

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

    acc_lst.append(test(model)[0])

print("Training complete!")


Epoch 0, Loss: 1.9440255165100098
Epoch 10, Loss: 0.5989371538162231
Epoch 20, Loss: 0.09909845143556595




Epoch 30, Loss: 0.021152963861823082
Epoch 40, Loss: 0.007933911867439747
Epoch 50, Loss: 0.004476470407098532
Epoch 60, Loss: 0.0031994658056646585
Epoch 70, Loss: 0.00258433073759079
Epoch 80, Loss: 0.0022107036784291267
Epoch 90, Loss: 0.0019445634679868817
Training complete!


In [2]:
acc_lst

[0.519,
 0.622,
 0.605,
 0.583,
 0.613,
 0.661,
 0.705,
 0.733,
 0.755,
 0.765,
 0.771,
 0.779,
 0.778,
 0.777,
 0.775,
 0.774,
 0.773,
 0.773,
 0.772,
 0.772,
 0.772,
 0.772,
 0.77,
 0.77,
 0.768,
 0.768,
 0.766,
 0.766,
 0.767,
 0.769,
 0.769,
 0.769,
 0.771,
 0.769,
 0.766,
 0.766,
 0.766,
 0.766,
 0.766,
 0.765,
 0.765,
 0.765,
 0.765,
 0.765,
 0.765,
 0.765,
 0.764,
 0.764,
 0.765,
 0.765,
 0.765,
 0.765,
 0.766,
 0.766,
 0.767,
 0.767,
 0.766,
 0.766,
 0.765,
 0.765,
 0.764,
 0.764,
 0.764,
 0.764,
 0.764,
 0.764,
 0.764,
 0.764,
 0.766,
 0.767,
 0.767,
 0.767,
 0.768,
 0.768,
 0.768,
 0.768,
 0.768,
 0.769,
 0.769,
 0.769,
 0.769,
 0.769,
 0.769,
 0.769,
 0.769,
 0.769,
 0.77,
 0.77,
 0.769,
 0.768,
 0.768,
 0.768,
 0.768,
 0.768,
 0.768,
 0.768,
 0.768,
 0.769,
 0.768,
 0.768]

## Explanation:
GCN aggregates features from a node’s neighbors using graph convolutions. This allows the network to learn representations based on both node features and graph structure.
The Cora dataset is used to classify nodes into one of 7 research topics.

## Questions (1 point each):

1. What would happen if we added more GCN layers (e.g., 3 layers instead of 2)? How would this affect over-smoothing?
2. What would happen if we used a larger hidden dimension (e.g., 64 instead of 16)? How would this impact the model's capacity?
3. What would happen if we replaced ReLU activation with a sigmoid function? Would the performance change?

4. What would happen if we trained on only 10% of the nodes and tested on the remaining 90%? How would the performance be affected?
5. What would happen if we used a different optimizer (e.g., RMSprop) instead of Adam? Would it affect the convergence speed?

Extra credit: 
1. What would happen if we used edge weights (non-binary) in the adjacency matrix? How would it affect message passing?
2. What would happen if we removed the log-softmax function in the output layer? Would the loss function still work correctly?

## No points, just for you to think about:
1. What would happen if we applied dropout to the node features during training? How would it affect the model’s generalization?
2. What would happen if we used mean-pooling instead of summing the messages in the GCN layers?
3. What would happen if we pre-trained the node features using a different algorithm, like Node2Vec, before feeding them into the GCN?


## Answers ## 
1) Adding more layers would mean each node gets more messages passed to them, so the features learn more about the neighbors but eventually their own features fade away and lose information.
2)  If we use  a larger hidden dimension, the input dimension would aggregate into a higher dimension than 16. So if the hidden dimension is 64, each vector would be represented with a feature size of 64 rather than 16. This gives the model more information and has a better performance. But if the hidden dimension is too high, i.e., very close to the input dimension, the model might just overfit and memorize everything rather than learning and generalizing to the test data.
4) If we train only on 10% of the data, it won't have enough information on it, i.e., the connections between nodes that should be present would not be counted. So the accuracy would be extremely low on the test set. If the training iterations are high, it might even overfit to the training data.
5) RMSprop had a lower convergence speed, and the accuracy was lower as well.


## EC ##
1) Since dropout drops random elements during training, it regularizes the model by adding randomness to it. So it reduces overfitting and increases generalization. However, if the dropout value is too high, the model might even underfit the data.
2) Mean pooling would eliminate one of the most important features of each node, i.e., the number of neighbors it has. This would give treat two nodes with the same average over their messages, even though one has significantly more neighbors than the other.
3) Feature extraction using Node2Vec before passing data into GCN should improve the overall performance as Node2Vec captures the information from all across the data, and GCN would capture information in the local neighborhood of each node. This is similar to the research paper we read for course prep over summer, where they would use an LLM for feature extraction before passing these features into a GCNSA model. 