
### Objective: 

In this assignment, implement the Node2Vec algorithm, a random-walk-based GNN, to learn node embeddings. Train a classifier using the learned embeddings to predict node labels.

### Dataset: 

Cora dataset: The dataset consists of 2,708 nodes (scientific publications) with 5,429 edges (citations between publications). Each node has a feature vector of size 1,433, and there are 7 classes (research topics).
Skeleton Code:

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch_geometric.datasets import Planetoid
from torch_geometric.utils import to_networkx
from node2vec import Node2Vec  # Importing Node2Vec for the random walk

# Load the Cora dataset
dataset = Planetoid(root='data/Cora', name='Cora')

# Prepare data
data = dataset[0]

# Convert to networkx for random walk
import networkx as nx
G = to_networkx(data, to_undirected=True)

# Node2Vec configuration
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=2) 
model = node2vec.fit(window=10, min_count=1)

# Embeddings for each node
embeddings = model.wv  # Node embeddings

# Define a simple classifier
class Classifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(Classifier, self).__init__()
        self.fc = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        return self.fc(x)

# Initialize classifier and optimizer
classifier = Classifier(64, 7)
optimizer = optim.Adam(classifier.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    classifier.train()
    optimizer.zero_grad()
    
    # Get node embeddings as input
    output = classifier(torch.tensor([embeddings[str(i)] for i in range(data.num_nodes)]))
    
    loss = criterion(output, data.y)
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")


ModuleNotFoundError: No module named 'node2vec'

## Explanation:
Node2Vec generates node embeddings by simulating random walks on the graph. These walks capture structural properties of nodes.
The generated embeddings are then used to train a classifier for predicting node labels.
The Cora dataset is a benchmark graph where nodes are papers and edges are citations.

## Questions (1 point each):
1. What would happen if we increased the number of walks (num_walks) per node? How might this affect the learned embeddings?
2. What would happen if we reduced the walk length (walk_length)? How would this influence the structural information captured by the embeddings?
4. What would happen if we used directed edges instead of undirected edges for the random walks?
5. What would happen if we added more features to the nodes (e.g., 2000-dimensional features instead of 1433)?
6. What would happen if we used a different dataset with more classes? Would the classifier performance change significantly?
8. What would happen if we used a larger embedding dimension (e.g., 128 instead of 64)? How would this affect the model’s performance and training time?



### Extra credit: 
1. What would happen if we increased the window size (window) for the skip-gram model? How would it affect the embedding quality?

## No points, just for you to think about
7. What would happen if we removed self-loops from the graph before training Node2Vec?

9. What would happen if we applied normalization to the node embeddings before feeding them to the classifier?

## Answers ##
1) If we increase the number of walks, we would get more embeddings from node2vec, so the model has more data to train from. This should theoretically increase the accuracy. However, if num_walks is too high, the model will memorize the entire graph and overfit the data.
2) Reducing the walk length would give us the same number of embeddings, but the sizes of these embeddings would be smaller, i.e, we would have less features. This means the model would learn more local  information and less global. This could work well in tasks like recommender systems, and higher walk lenghts would work for overall structure detections tasks such as graph classification.
3) Directed edges would have better accuracy in tasks which care about directionality, such as emails, where a person A sending an email to B doesn't mean B is sending the same email to A. Undirected egdes would be better for tasks which don't care about directionality, such as social networks where A being B's friend is the same as B being A's friend.
4) If we use more features, the accuracy should be higher, unless all of them contain unnecessary information and don't help the model learn any more than it already did.
5) I think the classifier performance would decrease with an increase in the number of classes. To get a better accuracy, you would have to increase the number of training iterations for the model to learn about the entire data.
6) This case is similar to increasing the walk length. Increasing the size of embeddings means that the model has more to learn about each node. If the model has a high node dievrsity, this would improve the accuracy, but since we're passing in more data to the model, the training would take longer time.


## EC ##
1) I believe that the result would be same as that of increasing walk length. Since the window size decides how many neighbors encountered in each walk we look at, a higher window size means the model would be learning more global information, and lower window size would help it learn more local information. 