## Second Challenge

The second challenge would be learning the node embeddings of CORA dataset in an **unsupervised** way, i.e. the node labels would not be available during training time.

We measure the performance of the embedding matrix by training a simple softmax classifier on the learned item embeddings on the training labels, and compute the accuracy on the test labels.  However, remember that both the training, validation, and test labels are **unavailable** during training; you MUST NOT use them.  Instead, please treat the evaluation routine as a black box, and only run the routine at test time.

In [1]:
##### DO NOT CHANGE THIS CELL
import torch
import torch.nn.functional as F
import numpy as np
import scipy.sparse as ssp
import dgl
import dgl.data
import dgl.nn.pytorch as dglnn
from collections import namedtuple

Args = namedtuple('Args', ['dataset'])
dataset = dgl.data.load_data(Args('cora'))

G = dgl.DGLGraph(dataset.graph)
X = torch.FloatTensor(dataset.features)

def evaluate(emb):
    """
    Evaluate the performance of the learned embedding.  The greater returned
    value the better.
    
    It trains a softmax regression model on the training set from the given
    embeddings, and return the accuracy on the test set.
    
    Parameters
    ----------
    emb : numpy.ndarray
        An N-by-M matrix where N is the number of nodes in CORA and M is
        the size of node embedding (can be of any value).
    """
    from sklearn.linear_model import LogisticRegressionCV
    global dataset
    C = LogisticRegressionCV(
        Cs=[1e-3, 1e-2, 1e-1, 1, 10, 100, 1000, 10000, 100000, 1e+6, 1e+7],
        multi_class='multinomial', solver='lbfgs', max_iter=10000)
    train_mask = (dataset.train_mask != 0)
    test_mask = (dataset.test_mask != 0)
    labels = dataset.labels
    C.fit(emb[train_mask], labels[train_mask])
    print('Best model found with C =', C.C_[0])
    return C.score(emb[test_mask], labels[test_mask])

We expect you to learn the node embeddings only from the given graph `G` and the node features `X`.  The following cell is an example solution which does nothing.  Please implement your model and report the number when you are done.

In [2]:
embedding = X.numpy()
print('Baseline performance using raw features:', evaluate(X.numpy()))
print('Baseline performance using my embedding:', evaluate(embedding))



Best model found with C = 100000.0
Baseline performance using raw features: 0.578




Best model found with C = 100000.0
Baseline performance using my embedding: 0.578
