# Self study 9

In this self study we are starting to investigate node classification. We are using a standard bibliographic dataset 'Cora', described here: https://relational.fit.cvut.cz/dataset/CORA This is still a rather small network, but a bit more serious than the Lazega lawyers, for example. It is a standard benchmark for node classification techniques. 

In [1]:
import numpy as np
import networkx as nx
import pandas as pd

We read the Cora data from two files. It turns out to be convenient to read the node attribute data first into a Pandas dataframe:

In [2]:
coragraph=nx.readwrite.edgelist.read_edgelist("cora.cites")

In [3]:
coraatts_pd=pd.read_csv("cora.content",delimiter="\t",header=None)

display(coraatts_pd)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434
0,31336,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,Neural_Networks
1,1061127,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,Rule_Learning
2,1106406,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
3,13195,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
4,37879,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Probabilistic_Methods
5,1126012,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,Probabilistic_Methods
6,1107140,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Theory
7,1102850,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Neural_Networks
8,31349,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Neural_Networks
9,1106418,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Theory


The classification problem here always is to predict the subject area of a paper.

We also need the data as a numpy array:

In [5]:
coraatts_arr=np.array(coraatts_pd)

print(coraatts_arr)

[[31336 0 0 ... 0 0 'Neural_Networks']
 [1061127 0 0 ... 0 0 'Rule_Learning']
 [1106406 0 0 ... 0 0 'Reinforcement_Learning']
 ...
 [1128978 0 0 ... 0 0 'Genetic_Algorithms']
 [117328 0 0 ... 0 0 'Case_Based']
 [24043 0 0 ... 0 0 'Neural_Networks']]


A problem now is that the order of nodes in coraatts does not correspond to the order in which nodes are enumerated by coragraph.nodes. The following fixes this problem:

In [6]:
rows=[]
for n in coragraph.nodes:
    rows.append(coraatts_arr[np.where(coraatts_arr[:,0]==int(n))[0],:])
coraatts_arr=np.vstack(rows)   

print(coraatts_arr)

[[35 0 0 ... 0 0 'Genetic_Algorithms']
 [1033 0 0 ... 0 0 'Genetic_Algorithms']
 [103482 0 0 ... 0 0 'Neural_Networks']
 ...
 [853155 0 0 ... 0 0 'Neural_Networks']
 [853115 0 0 ... 0 0 'Neural_Networks']
 [853118 0 0 ... 0 0 'Neural_Networks']]


**Task 1:** Recreate the experiments that are shown in the 'Independent_Classification' notebook. What is more effective, classification based on the attributes contained in coraats_arr, or classification based on coefficients in the singular value decomposition?

**Task 2:** Try some other approaches:
<ul>
    <li> Simple majority vote of the graph neighbors: predict the class label of a <i>test</i> node according to the majority of the class labels among the test node's graph neighbors. Here only neighbors belonging to the <i>training</i> set can be used!  </li>
    <li> Think of other node features you can construct, such as node degree, pagerank , etc. (networkx provides functions to compute such things). Does any of this increase your prediction accuracy?</li>
    </ul>

**Task 3:** The sklearn method train_test_split that was used in the Independent Classificaiton notebook to split the data into a training and a test set performs a purely random split. This is not always representative for how labeled and unlabeled nodes are distributed over a network in reality. Create an alternative split by selecting test nodes as follows:
<ul>
    <li>randomly select a small number of nodes (e.g. 3, 5, 10, ....) as "seed" test nodes </li>
    <li>add all direct neighbors of nodes in the test set to the test set</li>
    <li>... until the test test has reached a size of about 20% of the total number of nodes </li>
    </ul>
    
  Now redo the experiments with this train/test split. Are the results better or worse than what you obtained before with completely random splits?

**Task 4:** Implement the label propagation algorithm (either iterative or random walk version). Evaluate and compare the accuracy on the two different train/test split constructions.