In [2]:
import networkx as nx 
import torch 
import torch_geometric as torch_geo 
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures


### <b>Node Classification Tasks</b>


#### Approach each of the following classification tasks using the following methods:

* classify nodes using only node features via FCN (use no graph information).
* classify nodes using both node features and ad-hoc graph variables for each node (e.g. for a given node use degree, centrality measures, sum of words of the neighbors, etc as features).
* note: this is really the baseline to which one should compare e.g. GCN, n2v.
* classify nodes using only graph structure via node2vec (don’t use features for nodes).
* classify nodes using both graph structure and node features by concatenating node2vec embeddings and node features.
* classify nodes using GCN (using both node features and graph structure).

#### Requirements
1. Explore different architectures and hyperparameters. 
2. Compare these models for each classification problem, using standard metrics (accuracy, precision, recall) and a thoughtful data analysis (e.g. are false positives more often high degree nodes?). 
    - For node2vec and GCN, plot the dense vector embeddings and try to interpret them (e.g. use tSNE to plot the embeddings, colored by label or node degree).

#### Side Note: Testing on different types of graph
Your classification problems should always be tested in circumstances similar to how the algorithm be used! Graph problems will either be:
1. Transductive, when the graph structure is fixed ahead of time, but the labels remain unknown. 
    * e.g. given a fixed set of publications, can you classify them for an analysis?
2. Inductive, when the graph evolves/changes post-training.
    * e.g. a publisher has a service: when an author uploads their paper, the algorithm recommends a category/tag for labeling the paper.

#### Two Datasets
1. CORA: building off of last week's material 
    * Tip: start writing library functions automating portions for the next task.

2. Choose another dataset from SNAP for node classification 
    * Describe a real world situation where this node classification would be useful (is it inductive or transductive?).



In [3]:
# data loading function for CORA
def load_cora_torch(filepath="../data/raw/Planetoid"):
    """Return the CORA dataset"""
    dataset = Planetoid(root=filepath, name='Cora', transform=NormalizeFeatures()) # return a class of datasets
    data = dataset[0]
    # print some dataset statistics 
    print(f'Loads Cora dataset, at root location: {filepath}')
    print(f'Number of nodes: {data.num_nodes}')
    print(f'Number of edges: {data.num_edges}')
    print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
    print(f'Number of training nodes: {data.train_mask.sum()}')
    print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
    print(f'Has isolated nodes: {data.has_isolated_nodes()}')
    print(f'Has self-loops: {data.has_self_loops()}')
    print(f'Is undirected: {data.is_undirected()}')
    return data

In [4]:
cora = load_cora_torch() # load the cora dataset

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index


Loads Cora dataset, at root location: ../data/raw/Planetoid
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140
Training node label rate: 0.05
Has isolated nodes: False
Has self-loops: False
Is undirected: True


Processing...
Done!


In [8]:
cora

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])