### Semi-supervised learning with GCNs

This tutorial is constructed from 
 - A [google collab tutorial](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8?usp=sharing#scrollTo=NgcpV4rjAWy-) on graph neural networks.
 - A [second collab tutorial](https://colab.research.google.com/drive/14OvFnAXggxB8vM4e8vSURUp1TaKnovzX) on node classification using graph neural networks 
 - _Hands-On Graph Neural Networks Using Python_ Maxime Labonne, chapter 6
 - Lectures 6.1-6.4 from [Stanford machine learning course](https://www.youtube.com/watch?v=MH4yvtgAR-4)
 - a variety of other sources

I don't know the copyright status of the collab code (none is listed on the pages), but you should check before you use it in anything beyond coursework.

---

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

def visualize_graph(G, color):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False,
                     node_color=color, cmap="Set2")
    plt.axis('off')


def visualize_embedding(h, color, epoch=None, loss=None):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    h = h.detach().cpu().numpy()
    plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap="Set2")
    if epoch is not None and loss is not None:
        plt.xlabel(f'Epoch: {epoch}, Loss: {loss.item():.4f}', fontsize=16)
    

__Karate Club Network with Labeled Nodes_

This part of the tutorial uses code liberally from the [google colab tutorial](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8?usp=sharing#scrollTo=zF5bw3m9UrMy) on graph convolutional neural networks. I don't know the copyright status of this code (none is listed on the page), but you should check before you use it in anything beyond coursework. Note that this part of the tutorial doesn't do node classification, but we'll do that in the next part. 

In [None]:
#############
## Cell 0  ##
#############

from torch_geometric.datasets import KarateClub
from torch_geometric.data import Data as PyGData

dataset = KarateClub()
data: PyGData = dataset[0]
print(data)

Each data object has x, edge_index, and y. Some have other information like the masks.  
- x contains _node features_. The first index is the number of nodes, and the second index is the number of features. The cora database which we'll explore later has 2708 nodes each with a feature vector with 1433 components. The karate graph has 34 nodes and 34 features, discussed in the next cell.
- y contains _node labels_. If our goal is to classify nodes then it helps to have some of the nodes labeled.
- edge_index has _graph connectivity_. This is essentially the edge set of the graph. The first index contains the ordered edge pair. The second index is the number of edges. Each edge is directed, which means that an undirected graph with an edge $\{u,v\}$ appears wtice in the edge_index: once as $(u,v)$ and again as $(v,u)$.

Let's look at some more information from the dataset, specifically node label information. The karate club data set uses a "one-hot" encoding of node features. We'll discuss other node features when we 

In [None]:
dataset = KarateClub()
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

__Classes.__ There are four classes. The karate club dataset from pytorch geometric has chosen four nodes and given them different labels. This was done so that any work that follows can try to group the other nodes into one of those classes, resulting in four communities.



In [None]:
print(f"Correct node labels = \n\t{data.y}")



__Features.__ 34 features reminds us that this dataset is using one-hot encoding.

Let's construct a GCN, but rather than creating a graph autoencoder we'll create a network that learns to label nodes. As with the autoencoder work from the previous tutorial, the workhorse will be the GCNConv class from pytorch geometric.

__Initialization.__ The network architecture is
 - First hidden layers with 34 inputs and 4 outputs. 
 - Second hidden layers with 4 inputs and 4 outputs. 
 - Third hidden layers with 4 inputs and 2 outputs. 
 - Output layer, which tries to take the output from the third hidden layer and force it to chose one of four classes. 
Notice that the initialization defines weighting functions that use existing functions from pytorch or pytorch_geometric. We just have to specify the dimensions of each layer. 

__Forward.__ The forward function takes a graph as input and computes the output by applying the weighting and squashing. The feature vector (x) and the graph (edge_index) are the inputs to the model. The function then proceeds to do the weighting and squashing required. The function returns two things: an estimate of class of each node (e.g., to which community it belongs) and the embedding, which is defined as the output of the last hidden layer after it was squashed.

In [None]:
#############
## Cell 1  ##
#############
# Code from https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8?usp=sharing#scrollTo=H_VTFHd0uFz6
import torch
from torch.nn import Linear
from torch_geometric.nn import GCNConv
 
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        torch.manual_seed(1234)
        self.hidden_layer_1 = GCNConv(dataset.num_features, 4)
        self.hidden_layer_2 = GCNConv(4, 4)
        self.hidden_layer_3 = GCNConv(4, 2)
        self.output_layer = Linear(2, dataset.num_classes)

    def forward(self, x, edge_index):
        h = self.hidden_layer_1(x, edge_index)  # weight
        h = h.tanh()                            # squash
        h = self.hidden_layer_2(h, edge_index)  # weight
        h = h.tanh()                            # squash
        h = self.hidden_layer_3(h, edge_index)  # weight
        h = h.tanh()                            # squash. Final GNN embedding space.
                # Apply a final (linear) classifier.
        out = self.output_layer(h)
        return out, h

model = GCN()
print(model)

The initial weights for each layer and for the output are randomly chosen.  The initialization above set the random seed so that everyone who runs this code gets the same visualization. We are only interested in the two-dimensional embedding that is generated when we integrate features from the the three hop neighborhoods, so we'll ignore the other variable returned from the method. Note that the forward method is called by invoking the model.

In [None]:
#############
## Cell 2  ##
#############

_, embedding = model(data.x, data.edge_index)
print(f'Embedding shape: {list(embedding.shape)}')

visualize_embedding(embedding, color=data.y)

As the collab tutorial points out, we get pretty good separation even with random weights. This is because there is a ton of structural information available in three-hop neighborhoods.

---

We will now do some training. We already discussed how the data object held features (x), eges (edge_index), and node labels (y), but it also dontains something called _train_mask_. Let's remind ourselves of what is in the data object and then inspect the training mask.

In [None]:
###############
## Cell 2.5  ##
###############

print(data)
print(data.train_mask)

The training mask is just a vector of booleans, where _True_ means that we will use the data in the training process and _False_ means we won't.

We have the same pipeline as before:
 - use the model on some input data
 - compute the loss
 - derive the gradients
 - update the parameters based on the gradients

 Look at the following code, and then'll discuss the _criterion_ function.

In [None]:
#############
## Cell 3  ##
#############
model = GCN()
criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)  # Define optimizer.

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out, h = model(data.x, data.edge_index)  # Perform a single forward pass.
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss, h


The first big difference between this code and the code is the criterion used to compute the loss function. _Cross entropy loss_ is a method that computes an error signal between the labeled classes of nodes and the labels given by the network. The only thing that I'll mention about this loss function is that we typically take the output of the network and push it through a softmax operator. We talked about the softmax operator when we discussed node2vec and deepwalk. The key idea is that it tries to assign a probability to each of the possible labels that could given to the node given its input features.

The other difference is that the loss function isn't computed using all of the data. Using "out[data.train_mask]" and "data.y[data.train_mask]" says that we'll only compute the loss function on a subset of the nodes. All nodes will be used to compute out and h, but the loss is based on just a subset of nodes. The labels for nodes not identified in the training mask are never used to train the network.

__Training__

Let's do some training steps and look at how the nodes shift around until we have clear clusters. We'll only visualize the embedding every 50 steps or so.

In [None]:
#############
## Cell 4  ##
#############
for epoch in range(401):
    loss, embedding = train(data)
    if epoch % 50 == 0:
        visualize_embedding(embedding, color=data.y, epoch=epoch, loss=loss)

The collab tutorial celebrates the result, but the green and yellow nodes haven't clustered well. Let's talk about why this is a near result.

First, note that we only used labels from four of the 33 nodes in the network to train the network, but we tested on the classes of all the nodes. Despite the small number of nodes used in training, we still get a small loss function (see the label at the bottom of each plot) and we see that nodes with similar labels have some clustering.

Second, let's visually inspect how well the trained classifier performs. The output is a $n\times 4$ matrix that assigns a 4D vector score to each node $n$. Let's run this through a softmax to give us some sense of the probability of each node belonging to each class.


In [None]:
#############
## Cell 5  ##
#############
import numpy as np
out, h = model(data.x, data.edge_index)
softmax = torch.nn.Softmax(dim=1)
class_probabilities = np.round(softmax(out).detach().numpy(),2)
print(class_probabilities)

For each node, let's find the index that has maximum probability and compare to the correct class.

In [None]:
#############
## Cell 46 ##
#############
true_class = data.y.detach().numpy()
correct_count = 0
for i in range(data.num_features):
    if np.argmax(class_probabilities[i,:]) == true_class[i]: correct_count += 1
print(f"Succeeded {np.round(correct_count/34,2)*100}% of the time")
for i in range(data.num_features):
    print(f"Node i: estimated class is {np.argmax(class_probabilities[i,:])} and actual class is {true_class[i]}")


That's pretty good!


---

### Applying to Les Miserables
Let's apply the same pattern from the karate graph to the Les Miserables graph. We'll import the graph from networkx, convert it to a torch_geometric instance, make sure the features are set using one-hop encoding, and see what the clusters are. Doing so will allow us to compare the GCN model to node2vec.

In [None]:
import networkx as nx
from torch_geometric.utils import from_networkx
from torch import Tensor
G = nx.les_miserables_graph()   # read les miserables network
data = from_networkx(G)      # convert to pytorch data structure
data.x = Tensor(np.eye(len(G)))  # labels features using one-hop encoding
print(f"les miserables data = {data}")


__Creating our own node labels__

Let's label a few nodes by finding communities using the Louvain method, finding the node within each community that has maximum degree, and labeling that node. This is the supervised part of semi-supervised learning; we use some method to label nodes, and then train a network to label other nodes using those labels.

In [None]:
import community
nodes_dict = community.best_partition(G)
labeled_nodes = set()
labeled_node_list = []
for community in set(nodes_dict.values()):
    nodes = [key for key,val in nodes_dict.items() if val == community]
    SG = G.subgraph(nodes).copy()
    plt.figure()
    nx.draw_networkx(SG, pos = nx.spring_layout(SG, seed = 0), node_color = 'y', node_size = 800, with_labels=True, alpha = 0.8)
    labeled_node, max_degree = None, 0
    for node in nodes:
        if SG.degree(node) > max_degree:
            max_degree = SG.degree(node)
            labeled_node = node
    labeled_node_list.append(labeled_node)
    title = f"Subgraph of community centered on {labeled_node}"
    labeled_nodes.add(labeled_node)
    plt.title(title)
    


Set up the training mask and classes

In [None]:
#print(G.nodes)
mask = [False for node in G.nodes]
i = 0
for node in nodes_dict.keys():
    if node in labeled_nodes:
        mask[i] = True
    i += 1
data.y = torch.tensor(list(nodes_dict.values()), dtype=torch.uint8)
data.train_mask = torch.tensor(mask, dtype=bool)
print(data)

Create the NN.

In [None]:
import torch
from torch.nn import Linear
from torch_geometric.nn import GCNConv


class GCN(torch.nn.Module):
    def __init__(self, hidden_channels_1, hidden_channels_2):
        super().__init__()
        torch.manual_seed(1234)
        self.hidden_layer_1 = GCNConv(len(G.nodes), hidden_channels_1)
        self.hidden_layer_2 = GCNConv(hidden_channels_1, hidden_channels_2)
        self.output_layer = Linear(hidden_channels_2, len(labeled_nodes))

    def forward(self, x, edge_index):
        h = self.hidden_layer_1(x, edge_index)  # weight
        h = h.relu()                            # squash
        h = self.hidden_layer_2(h, edge_index)  # weight
        h = h.relu()                            # squash
        
        # Apply a final (linear) classifier.
        out = self.output_layer(h)

        return out, h
model = GCN(16,8)
print(model)

Define the training function

In [None]:
import time
criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)  # Define optimizer.

def train(data):
    optimizer.zero_grad()  # Clear gradients.
    out, h = model(data.x, data.edge_index)  # Perform a single forward pass.
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss, h

Create some visualization and clustering helper functions

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

def visualize(h):
    z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())
    colorlist = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#ffff33', '#a65628']

    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])

    for class_number in set(nodes_dict.values()):
        index_list = extract_nodes_by_class(data.y,class_number)
        plt.scatter(z[index_list, 0], z[index_list, 1], s=70, c=colorlist[class_number])

    _ = plt.legend(labeled_node_list,bbox_to_anchor=(1, 1), loc='upper left')
    plt.show()

def extract_nodes_by_class(data,class_number):
    index_list = []
    for index, _ in enumerate(data):
        if data[index] == class_number:
            index_list.append(index)
    return index_list

We'll now display the true labels for each node in the embedding space. 

In [None]:
for epoch in range(5001):
    loss, embedding = train(data)
visualize(embedding)

Let's add edges into this embedding to get a feel for how the pieces fit together.

In [None]:
colorlist = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#ffff33', '#a65628']

z = TSNE(n_components=2).fit_transform(embedding.detach().cpu().numpy())
pos = nx.spring_layout(G)
#print(pos)
i=0
for node in pos.keys():
    #print(pos[node])
    pos[node] = [z[i,0], z[i,1]]
    i = i + 1
plt.figure(figsize = (10, 10))
labellist = list(nodes_dict.values())
for class_number in set(nodes_dict.values()):
        nodelist = [key for key,val in nodes_dict.items() if val == class_number]
        positions = {key:val for key, val in pos.items() if key in nodelist}
        node_sizes = [20 + 20*val for key, val in dict(G.degree).items() if key in nodelist]
        nx.draw_networkx_nodes(G, pos=positions, nodelist=nodelist, 
                               node_size = node_sizes, node_color = colorlist[class_number], 
                               label=labeled_node_list[class_number])

#for num, i in enumerate(zip(G.nodes, labellist)):
#    n, l = i[0], i[1]
#    nx.draw_networkx_nodes(G, pos, nodelist=n, node_size = 5, node_color = colorlist[num], label=l)
nx.draw_networkx_edges(G, pos, width = 0.25)
_ = plt.legend(bbox_to_anchor=(1, 1), loc='upper left')

I think that's the neatest visualization that we've seen so far of the Les Miserables graph.

---

### Cora: A More sophisticated Example

In this example, we'll be using the Cora database. See [summary here](https://paperswithcode.com/dataset/cora) and [Automating the Construction of Internet Portals with Machine Learning](https://link.springer.com/article/10.1023/A:1009953814988), A. McCallum et al. 

We'll be using pytorch and pytorch-geometric since they have neat implementations of the neural network components that aren't core to what is emphasized in class. It might be useful to get familiar with the [Planetoid class and graphs](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Planetoid.html). Label dictionary taken from [medium article](https://medium.com/mlearning-ai/ultimate-guide-to-graph-neural-networks-1-cora-dataset-37338c04fe6f).

In [None]:
# Some of the code below is copied from Lebonne, pp. 69-70. Do not share.
# pip install torch
# pip install torch_geometric
from torch_geometric.datasets import Planetoid
from torch import Tensor
# Download the dataset
dataset = Planetoid(root=".",name="Cora")
# "Cora has only one graph we can store in a dedicated data variable", Lebonne p. 69
data = dataset[0]
# Get familiar with the dataset
print(f'Dataset: {dataset}')
print('----------------')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of nodes: {data.x.shape[0]}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

There is one graph with about 2,700 nodes. Each node is placed into one of seven classes. Each node also has a vector of features associated with it. The features are built on a technique called "bag of words". The basic building block of a bag-of-words model is a vector of words associated with all the nodes. Think of it like this. A network of computer science papers is created with one node per network and a directed edge between papers if the first paper cites the second. Someone went through the titles, abstracts, and maybe some other parts of the papers and identified important words. This set of words is the "vocabulary" associated with the network. Organize the vocabulary into an ordered data type like a vector. We'll call this the _vocabulary vector_.

Each node in the network is then identified by looking at the words from the vocabulary that made it into the paper. Each node is assigned a vector/tensor of the same size as the vocabulary vector. We call the vector associated with the node a _feature tensor_. If the word in the $j^{\rm th}$ element of the vocabulary vector is in the paper, then the $j^{\rm th}$ element of the feature tensor is assigned a value of 1; otherwise the $j^{\rm th}$ element of the feature vector is assigned a 0. Let's inspect one of the feature vectors form the Cora dataset.

---

Let's inspect the data a bit more closely.

In [None]:
print(data)
# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Has isolated nodes: {data.has_isolated_nodes()}')
print(f'Has self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')

Each data object has x, edge_index, and y. Some have other information like the masks.  
- x contains _node features_. The first index is the number of nodes, and the second index is the number of features. Thus, the cora database has 2708 nodes each with a feature vector with 1433 components.
- y contains _node labels_. If our goal is to classify nodes then it helps to have some of the nodes be labeled.
- edge_index has _graph connectivity_. This is essentially the edge set of the graph. The first index contains the ordered edge pair. The second index is the number of edges. Each edge is directed, which means that an undirected graph with an edge $\{u,v\}$ appears wtice in the edge_index: once as $(u,v)$ and again as $(v,u)$.

---

Let's inspect the features more closely.

In [None]:
# The x member of the data class contains nodes and their features
# The features are a "bag of words". Words were taken from descriptions of 
# the indexed papers. A "1" indicates that the word in the bag is on the 
# citation 0.
print(f'Features 5-25 of node 0: {data.x[0][5:25]}')
print(f'Features 200-255 of node 1: {data.x[1][200:255]}')
print(f'Features 5-25 of node 2: {data.x[2][5:25]}')
print("--------------------")

And let's inspect the labels more closely.

In [None]:
# The y member of the data class contains node classes.
label_dict = {
    0: "Theory",
    1: "Reinforcement_Learning",
    2: "Genetic_Algorithms",
    3: "Neural_Networks",
    4: "Probabilistic_Methods",
    5: "Case_Based",
    6: "Rule_Learning"}
classes: list[int] = Tensor.tolist(data.y[:])
print(f'Class of node 0: {label_dict[classes[0]]}')
print(f'Class of node 1: {label_dict[classes[1]]}')
print(f'Class of node 3: {label_dict[classes[3]]}')
print("--------------------")


It's useful to visualize the graph to start to understand its structures. Using methods from [this article from Medium](https://medium.com/mlearning-ai/ultimate-guide-to-graph-neural-networks-1-cora-dataset-37338c04fe6f), we'll turn the data into a graph. 

In [None]:
import networkx as nx
from torch_geometric.utils import to_networkx
G = to_networkx(data, to_undirected=True)

The graph is not connected, but I want to just see the largest component. Plotting the graph is slow in networkx, so I'm including three  options for you:
 - save the data as a gephi file so you can play with Gephi visualizations
 - plot positions using the spring layout (about a minute on my mac)
 - plot positions using the neato layout (about 1.5 minutes on my mac)

In [None]:
from matplotlib import pyplot as plt
# pull out largest connected component
largest_cc = max(nx.connected_components(G), key=len)
S= G.subgraph(largest_cc).copy()
#nx.write_gexf(G, "cora.gexf")

node_color = []
nodelist = [[], [], [], [], [], [], []]
colorlist = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#ffff33', '#a65628']
labels = data.y
for n, i in enumerate(labels):
    if n in S.nodes:
        node_color.append(colorlist[i])
        nodelist[i].append(n)
#pos = nx.spring_layout(S, seed = 42)
pos = nx.nx_pydot.graphviz_layout(S,prog="neato") # slower than spring, but shows structure better
plt.figure(figsize = (10, 10))
labellist = list(label_dict.values())
for num, i in enumerate(zip(nodelist, labellist)):
    n, l = i[0], i[1]
    nx.draw_networkx_nodes(S, pos, nodelist=n, node_size = 5, node_color = colorlist[num], label=l)
nx.draw_networkx_edges(S, pos, width = 0.25)
_ = plt.legend(bbox_to_anchor=(1, 1), loc='upper left')

The structure suggests that a common pattern for papers is that someone publishes an "important" paper that is then used in subsequent papers. This is pretty typical for work done in my lab; Students publish a paper to get things rolling, and then improve on the ideas in the paper while citing the first. 

---

Let's import the visualization function from the [second collab tutorial](https://colab.research.google.com/drive/14OvFnAXggxB8vM4e8vSURUp1TaKnovzX#scrollTo=imGrKO5YH11-). Note that the visualization function compresses whatever it is given into two dimensions. It uses the same compression tool we used when we discussed deepwalk and node2vec, namely TSNE.

I've modified the visualization to use the label_dict dictionary and the colorlist defined above. This allows us to see a legend.

In [None]:
# Helper function for visualization.
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

def visualize(h, color):
    z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())

    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])

    for class_number in label_dict.keys():
        index_list = extract_nodes_by_class(data.y,class_number)
        plt.scatter(z[index_list, 0], z[index_list, 1], s=70, c=colorlist[class_number], cmap="Set2")

    #plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap="Set2")
    _ = plt.legend(["Theory", "Reinforcement_Learning", "Genetic_Algorithms",
                    "Neural_Networks", "Probabilistic_Methods",
                    "Case_Based", "Rule_Learning"],bbox_to_anchor=(1, 1), loc='upper left')
    plt.show()

def extract_nodes_by_class(data,class_number):
    index_list = []
    for index, item in enumerate(data):
        if data[index] == class_number:
            index_list.append(index)
    return index_list
#print(extract_nodes_by_class(data.y,1))

The collab tutorial has a neat aside where a traditional neural network is trained (a multilayer perceptron). The classification accuracy is lousy for two reasons. First, the data is not linearly separable (at least not easily so). Second, the MLP doesn't take advantage of the extra information available in the graph structures. _The information available in the graph structures is why graph machine learning can be very useful._

---

Let's follow the collab tutorial and create a graph convolutional neural network with one hidden layer and one output layer. The hidden layer has dimension _hidden_channels_. This means that the low dimensional embedding has dimension equal to _hidden_channels_.  The hidden layer uses a rectified lineary unit, which effectually removes negative values from the matrix multiplications. Between the hidden layer and the output layer is a call to [F.dropout](https://pytorch.org/docs/stable/nn.functional.html#dropout-functions). My intuition about neural networks isn't strong enough to know why they do this, but I included it anyway. 

The output layer just computes the matrix multiplications. As before, I've renamed the the layers.

In [None]:

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        torch.manual_seed(1234567)
        self.hidden = GCNConv(dataset.num_features, hidden_channels)
        self.output = GCNConv(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index):
        x = self.hidden(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.output(x, edge_index)
        return x

model = GCN(hidden_channels=16)
print(model)

The hidden layer integrates one-hop neighborhoods, and the output layer does the matrix multiplication one more time to get two-hop neighborhoods.

In [None]:
model = GCN(hidden_channels=16)
model.eval()

embedding = model(data.x, data.edge_index)
visualize(embedding, color=data.y)

There is not enough structure in the two hop neighborhoods to separate the nodes into classes without training.

Let's train. Code is from the second collab tutorial. It is very close to what we did with the karate network, but I'd like to highlight a few differences.
 - Computing the loss only uses a subset of the data.
 - Testing how well the training performs also uses a subset of the data, but a different subset.
 - The output layer gives us seven numbers, and we choose the best one via the _argmax_.

In [None]:
#from IPython.display import Javascript  # Restrict height of output cell.
#display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))

model = GCN(hidden_channels=16)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
      model.train()
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

def test():
      model.eval()
      out = model(data.x, data.edge_index)
      pred = out.argmax(dim=1)  # Use the class with highest probability.
      test_correct = pred[data.test_mask] == data.y[data.test_mask]  # Check against ground-truth labels.
      test_acc = int(test_correct.sum()) / int(data.test_mask.sum())  # Derive ratio of correct predictions.
      return test_acc


for epoch in range(1, 101):
    loss = train()
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

Good, the loss goes down. Let's plot and see whether we have found good clusters.

In [None]:
model.eval()

embedding = model(data.x, data.edge_index)
visualize(embedding, color=data.y)

The embedding did indeed find communities. Nodes with similar labels grouped together. In order to do this, we used supervised learning, which means that someone had _supervised_ the learning by _labeling the nodes_. We then used that labeling to find the communities. The good news is that whoever did the labeling only had to label 5% of the nodes. Technically, this is called _semi-supervised_ learning -- a few labels, a lot of features, training and test sets, and exploiting graph structures to enable learning.  (Note that we actually knew all the labels, we just used a few of them to train. Without knowing the labels we wouldn't have been able to create the plot above.) 

---

Before I leave this example, I just want to peak at what the graph looks like when we use the embedding above and add in the edges.

In [None]:
node_color = []
nodelist = [[], [], [], [], [], [], []]
colorlist = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#ffff33', '#a65628']
labels = data.y
for n, i in enumerate(labels):
    if n in S.nodes:
        node_color.append(colorlist[i])
        nodelist[i].append(n)

pos = nx.spring_layout(S, seed = 42)
z = TSNE(n_components=2).fit_transform(embedding.detach().cpu().numpy())


In [None]:
print(pos[1])
for node in pos.keys():
    pos[node] = z[node,:]
print(pos[1])
#pos = nx.nx_pydot.graphviz_layout(S,prog="neato") # slower than spring, but shows structure better
plt.figure(figsize = (10, 10))
labellist = list(label_dict.values())
for num, i in enumerate(zip(nodelist, labellist)):
    n, l = i[0], i[1]
    nx.draw_networkx_nodes(S, pos, nodelist=n, node_size = 5, node_color = colorlist[num], label=l)
nx.draw_networkx_edges(S, pos, width = 0.05)
_ = plt.legend(bbox_to_anchor=(1, 1), loc='upper left')

#### Unsupervised learning with node similarity

Leskovec says that you can blend feature-based learning in graph convolutional neural networks with similarity metrics like those used in deepwalk and node2vec. I won't show an example of that here, but check out [minute 28 of his course](https://www.youtube.com/watch?v=MH4yvtgAR-4).

---
