# Self study 9

In this self study we are starting to investigate node classification. We are using a standard bibliographic dataset 'Cora', described here: https://relational.fit.cvut.cz/dataset/CORA This is still a rather small network, but a bit more serious than the Lazega lawyers, for example. It is a standard benchmark for node classification techniques. 

In [1]:
import numpy as np
import networkx as nx
import pandas as pd

We read the Cora data from two files. It turns out to be convenient to read the node attribute data first into a Pandas dataframe:

In [2]:
coragraph=nx.readwrite.edgelist.read_edgelist("cora.cites")

In [3]:
coraatts_pd=pd.read_csv("cora.content",delimiter="\t",header=None)

coraatts_pd

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434
0,31336,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,Neural_Networks
1,1061127,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,Rule_Learning
2,1106406,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
3,13195,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
4,37879,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Probabilistic_Methods
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703,1128975,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Genetic_Algorithms
2704,1128977,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Genetic_Algorithms
2705,1128978,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Genetic_Algorithms
2706,117328,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Case_Based


The classification problem here always is to predict the subject area of a paper.

We also need the data as a numpy array:

In [4]:
coraatts_arr=np.array(coraatts_pd)

coraatts_arr

array([[31336, 0, 0, ..., 0, 0, 'Neural_Networks'],
       [1061127, 0, 0, ..., 0, 0, 'Rule_Learning'],
       [1106406, 0, 0, ..., 0, 0, 'Reinforcement_Learning'],
       ...,
       [1128978, 0, 0, ..., 0, 0, 'Genetic_Algorithms'],
       [117328, 0, 0, ..., 0, 0, 'Case_Based'],
       [24043, 0, 0, ..., 0, 0, 'Neural_Networks']], dtype=object)

A problem now is that the order of nodes in coraatts does not correspond to the order in which nodes are enumerated by coragraph.nodes. The following fixes this problem:

In [5]:
rows=[]
for n in coragraph.nodes:
    rows.append(coraatts_arr[np.where(coraatts_arr[:,0]==int(n))[0],:])
coraatts_arr=np.vstack(rows)   

coraatts_arr

array([[35, 0, 0, ..., 0, 0, 'Genetic_Algorithms'],
       [1033, 0, 0, ..., 0, 0, 'Genetic_Algorithms'],
       [103482, 0, 0, ..., 0, 0, 'Neural_Networks'],
       ...,
       [853155, 0, 0, ..., 0, 0, 'Neural_Networks'],
       [853115, 0, 0, ..., 0, 0, 'Neural_Networks'],
       [853118, 0, 0, ..., 0, 0, 'Neural_Networks']], dtype=object)

**Task 1:** Recreate the experiments that are shown in the 'Independent_Classification' notebook. What is more effective, classification based on the attributes contained in coraats_arr, or classification based on coefficients in the singular value decomposition?

In [6]:
# Extract labels and feature matrixes
Y = coraatts_arr[::,1434]
features = coraatts_arr[::, 1:1434]


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix,accuracy_score)
# Split data
feat_train, feat_test, Y_train, Y_test = train_test_split(features, Y, train_size=0.8)

# Learn a decision tree classifier
dtree = DecisionTreeClassifier(min_samples_split=32, max_depth=128)

dtree.fit(feat_train, Y_train)

print("Accuracy on training data: {}".format(dtree.score(feat_train, Y_train)))

# Predict and evaluate model
Y_pred = dtree.predict(feat_test)

print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Accuracy on training data: 0.8457987072945522
Confusion matrix: 
[[32  0  1  3  2  1  5]
 [ 4 86 12  2  2  1  0]
 [14  9 92  8  4  1  8]
 [ 2  1 22 69  1  2  4]
 [ 1  2  8  3 28  0  4]
 [ 5  0  4  1  1 21  6]
 [10  0  9  4  5  6 36]]
Accuracy: 0.6715867158671587


In [8]:
# Learn a logistic regression model
#learn:
lr=LogisticRegression(solver="lbfgs",class_weight="balanced")
lr.fit(feat_train,Y_train)

#test:
Y_pred=lr.predict(feat_test)

#evaluate:
print("Test data confusion matrix: \n {}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Test data confusion matrix: 
 [[ 35   0   1   2   1   2   3]
 [  1  88   8   3   2   3   2]
 [ 10   6 102   6   2   2   8]
 [  4   0  11  76   0   4   6]
 [  0   3   5   1  36   0   1]
 [  6   0   1   0   1  27   3]
 [  5   0   5   4   2   6  48]]
Accuracy: 0.7601476014760148


In [39]:
# Use adjacency matrix SVD
A=nx.linalg.graphmatrix.adjacency_matrix(coragraph).todense()
svdA_features=np.linalg.svd(A)[0][:,:64]
svdA_features=np.asarray(svdA_features)

feat_train, feat_test, Y_train, Y_test  = train_test_split(svdA_features,Y , train_size=0.8)


In [44]:
# Learn a decision tree classifier
dtree = DecisionTreeClassifier(max_depth=32)

dtree.fit(feat_train, Y_train)

print("Accuracy on training data: {}".format(dtree.score(feat_train, Y_train)))

# Predict and evaluate model
Y_pred = dtree.predict(feat_test)

print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))


Accuracy on training data: 0.948291782086796
Confusion matrix: 
[[ 38   0   9   3   1   4   6]
 [  2  58  11   0   3   1   4]
 [  2   7 116  11   1   1  13]
 [  2   2  19  58   3   0   3]
 [  4   3   5   0  19   0   6]
 [  5   2  10   0   0  15   4]
 [  4   2  18   9   2   5  51]]
Accuracy: 0.6549815498154982


In [45]:
# Learn a logistic regression model
#learn:
lr=LogisticRegression(solver="lbfgs",class_weight="balanced")
lr.fit(feat_train,Y_train)

#test:
Y_pred=lr.predict(feat_test)

#evaluate:
print("Test data confusion matrix: \n {}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Test data confusion matrix: 
 [[ 45   0  11   0   0   5   0]
 [  0  54  23   0   2   0   0]
 [  2   2 129   7   2   1   8]
 [  2   0  24  54   2   0   5]
 [  4   1  15   1  16   0   0]
 [  1   0  13   0   0  22   0]
 [  4   0  23   0   0   6  58]]
Accuracy: 0.6974169741697417


In [46]:
# Use lazenga SVD
L=nx.linalg.laplacianmatrix.laplacian_matrix(coragraph).todense()
svdL_features=np.linalg.svd(L)[0][:,len(L)-128:]
svdL_features=np.asarray(svdL_features)

feat_train, feat_test, Y_train, Y_test = train_test_split(svdL_features, Y, train_size=0.8)

In [51]:
# Learn a decision tree classifier
dtree = DecisionTreeClassifier(max_depth=32)

dtree.fit(feat_train, Y_train)

print("Accuracy on training data: {}".format(dtree.score(feat_train, Y_train)))

# Predict and evaluate model
Y_pred = dtree.predict(feat_test)

print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Accuracy on training data: 0.9907663896583564
Confusion matrix: 
[[ 55   0   4   2   0   6   3]
 [  2  71   4   0   7   0   1]
 [  3   1 143   4   8   1   6]
 [  0   0   9  74   0   0   1]
 [  1   0   2   0  35   0   1]
 [  3   0   1   0   0  27   1]
 [  5   0  10   7   1   3  40]]
Accuracy: 0.8210332103321033


In [52]:
# Learn a logistic regression model
#learn:
lr=LogisticRegression(solver="lbfgs",class_weight="balanced")
lr.fit(feat_train,Y_train)

#test:
Y_pred=lr.predict(feat_test)

#evaluate:
print("Test data confusion matrix: \n {}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Test data confusion matrix: 
 [[ 50   1   0   1   1  13   4]
 [  0  85   0   0   0   0   0]
 [  0   0 133   3  17   6   7]
 [  2   0  11  61   3   6   1]
 [  0   0   0   0  39   0   0]
 [  0   0   0   0   0  31   1]
 [  1   1   5   2   4  15  38]]
Accuracy: 0.8062730627306273


**Task 2:** Try some other approaches:
<ul>
    <li> Simple majority vote of the graph neighbors: predict the class label of a <i>test</i> node according to the majority of the class labels among the test node's graph neighbors. Here only neighbors belonging to the <i>training</i> set can be used!  </li>
    <li> Think of other node features you can construct, such as node degree, pagerank , etc. (networkx provides functions to compute such things). Does any of this increase your prediction accuracy?</li>
    </ul>

In [69]:
# Majority vote
def majority_vote(node):
    if node in coragraph.nodes() and node not in feat_train:
        result = {}
        # Get node neighbours
        for e in nx.edges(coragraph, nbunch=(node)):
            m = e[1]
            if m in feat_train:
                idx = feat_train.index(m)
                l = Y_train[idx]
                if l in result.keys():
                    result[l] += 1
                else:
                    result[l] = 1
        # Select most common label of neighbours
        r = list(result.keys())
        r.sort(key=lambda x: result[x], reverse=True)
        if len(r) > 0:
            return r[0]
        else:
            # if node had no neighbours, return most common label
            labels = {}
            for l in Y_train:
                if l in labels.keys():
                    labels[l] += 1
                else:
                    labels[l] = 1
            ls = list(labels.keys())
            ls.sort(key=lambda x: labels[x], reverse=True)
            most_common_label = ls[0]
            return most_common_label
    else:
        return False


In [74]:
feat_train, feat_test, Y_train, Y_test = train_test_split([str(x) for x in coraatts_arr[::,0]], Y, train_size=0.8)


In [75]:
# train
Y_pred = [majority_vote(x) for x in feat_test]
#evaluate:
print("Test data confusion matrix: \n {}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Test data confusion matrix: 
 [[ 34   2   1   0   1   5   5]
 [  1  59   7   0   1   0   1]
 [  3   4 133   8   1   1   9]
 [  1   0   8  74   1   0  11]
 [  0   3   5   0  34   0   2]
 [  3   0   1   0   0  29   3]
 [  0   0  12   4   0   2  73]]
Accuracy: 0.8044280442804428


**Task 3:** The sklearn method train_test_split that was used in the Independent Classificaiton notebook to split the data into a training and a test set performs a purely random split. This is not always representative for how labeled and unlabeled nodes are distributed over a network in reality. Create an alternative split by selecting test nodes as follows:
<ul>
    <li>randomly select a small number of nodes (e.g. 3, 5, 10, ....) as "seed" test nodes </li>
    <li>add all direct neighbors of nodes in the test set to the test set</li>
    <li>... until the test test has reached a size of about 20% of the total number of nodes </li>
    </ul>
    
  Now redo the experiments with this train/test split. Are the results better or worse than what you obtained before with completely random splits?

In [165]:
import random
total_nodes = len(coragraph.nodes())
test_size = total_nodes * 0.05
test = []
while len(test) < 10:
    r = random.choice(list(coragraph.nodes()))
    if r not in test:
        test.append(r)
to_visit = [x for x in test]
while len(test) < test_size and len(to_visit) > 0:
    next_visit = to_visit.pop(0)
    for edge in nx.edges(coragraph, nbunch=(next_visit)):
        v = edge[1]
        if v not in to_visit and v not in test and len(test) < test_size:
            test.append(v)
            to_visit.append(v)


In [161]:
def split(lst):
    train_set = []
    test_set = []
    for i in range(len(lst)):
        node = str(coraatts_arr[i][0])
        if node in test:
            test_set.append(lst[i])
        else:
            train_set.append(lst[i])
    return np.asarray(train_set), np.asarray(test_set)

In [143]:
# Use node features
Y = coraatts_arr[::,1434]
features = coraatts_arr[::, 1:1434]
feat_train, feat_test = split(features)
Y_train, Y_test = split(Y)

In [95]:
# Learn a decision tree classifier
dtree = DecisionTreeClassifier( max_depth=64)

dtree.fit(feat_train, Y_train)

print("Accuracy on training data: {}".format(dtree.score(feat_train, Y_train)))

# Predict and evaluate model
Y_pred = dtree.predict(feat_test)

print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Accuracy on training data: 0.9796860572483841
Confusion matrix: 
[[  9   0   4   2   2   1   2]
 [ 10 216  54   3  18   8  12]
 [  5   2  87   4   8   1   8]
 [  1   0   4  15   2   1   2]
 [  4   3   6   0  18   3   3]
 [  0   0   2   0   0   1   1]
 [  3   0   4   1   2   3   7]]
Accuracy: 0.6512915129151291


In [96]:
# Learn a logistic regression model
#learn:
lr=LogisticRegression(solver="lbfgs",class_weight="balanced")
lr.fit(feat_train,Y_train)

#test:
Y_pred=lr.predict(feat_test)

#evaluate:
print("Test data confusion matrix: \n {}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Test data confusion matrix: 
 [[ 15   0   1   0   1   0   3]
 [ 12 246  26   7  12   7  11]
 [  6   2  93   6   3   2   3]
 [  1   0   3  20   0   0   1]
 [  1   4   5   0  22   4   1]
 [  0   0   0   0   0   2   2]
 [  1   0   3   1   0   5  10]]
Accuracy: 0.7527675276752768


In [110]:
# Use SVD of transition matrix
feat_train, feat_test = split(svdA_features)

In [107]:
# Learn a decision tree classifier
dtree = DecisionTreeClassifier( max_depth=46)

dtree.fit(feat_train, Y_train)

print("Accuracy on training data: {}".format(dtree.score(feat_train, Y_train)))

# Predict and evaluate model
Y_pred = dtree.predict(feat_test)

print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Accuracy on training data: 0.9362880886426593
Confusion matrix: 
[[  6   3   5   2   2   0   2]
 [  9 106  35  72  94   1   4]
 [  4   1  63  21  19   2   5]
 [  1   1   2  20   1   0   0]
 [  2   5   3   6  20   0   1]
 [  4   0   0   0   0   0   0]
 [  2   3   7   2   3   1   2]]
Accuracy: 0.4003690036900369


In [111]:
# Learn a logistic regression model
#learn:
lr=LogisticRegression(solver="lbfgs",class_weight="balanced")
lr.fit(feat_train,Y_train)

#test:
Y_pred=lr.predict(feat_test)

#evaluate:
print("Test data confusion matrix: \n {}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Test data confusion matrix: 
 [[ 13   3   0   0   2   1   1]
 [  4 217  91   1   5   0   3]
 [  3  56  40   7   3   0   6]
 [  0   1   3  21   0   0   0]
 [  0  20   2   0  14   0   1]
 [  4   0   0   0   0   0   0]
 [  1   9   3   0   0   2   5]]
Accuracy: 0.5719557195571956


In [112]:
# use laplacian SVD
feat_train, feat_test = split(svdL_features)

In [121]:
# Learn a decision tree classifier
dtree = DecisionTreeClassifier( max_depth=46)

dtree.fit(feat_train, Y_train)

print("Accuracy on training data: {}".format(dtree.score(feat_train, Y_train)))

# Predict and evaluate model
Y_pred = dtree.predict(feat_test)

print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Accuracy on training data: 0.9889196675900277
Confusion matrix: 
[[ 14   2   3   0   1   0   0]
 [  3 258  23   1  31   3   2]
 [  3   4  98   2   1   3   4]
 [  1   0   2  22   0   0   0]
 [  1   9   5   0  21   0   1]
 [  3   0   0   0   0   1   0]
 [  2   1   4   0   2   1  10]]
Accuracy: 0.7822878228782287


In [123]:
# Learn a logistic regression model
#learn:
lr=LogisticRegression(solver="lbfgs",class_weight="balanced")
lr.fit(feat_train,Y_train)

#test:
Y_pred=lr.predict(feat_test)

#evaluate:
print("Test data confusion matrix: \n {}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Test data confusion matrix: 
 [[ 14   3   0   0   1   2   0]
 [  2 311   6   0   1   0   1]
 [  0   4  95   1   8   2   5]
 [  0   0   4  18   3   0   0]
 [  0  14   2   0  21   0   0]
 [  4   0   0   0   0   0   0]
 [  2   3   6   0   2   2   5]]
Accuracy: 0.8560885608856088


**Task 4:** Implement the label propagation algorithm (either iterative or random walk version). Evaluate and compare the accuracy on the two different train/test split constructions.

In [124]:
def rw_mat(test):
    result = []
    for i in range(len(A)):
        id = str(coraatts_arr[i][0])
        if id in test:
            count = np.sum(A[i])
            result.append(A[i] / count)
        else:
            result.append(np.zeros(len(A)))
    return np.asarray(result)

In [175]:
def label_propagation(lst):
    result = []
    rw = rw_mat(lst)
    idxs = [str(x) for x in coraatts_arr[::,0]]
    n = 0
    for node in lst:
        print(f'{n} of {len(lst)}')
        n +=1
        idx = idxs.index(node)
        start = np.zeros(len(rw))
        start[idx] = 1
        flag = 1
        border = 1 / 10 ** 8
        reps = 0
        while flag > border and reps < 100:
            next = np.asarray(np.dot(start, rw)).flatten()
            flag = np.sum(np.abs(next - start))
            start = next
            reps += 1
        labels = {'dunno': 0.0}
        for i in range(len(start)):
            if start[i] > 0:
                l = Y[i]
                if i in labels.keys():
                    labels[l] += start[i]
                else:
                    labels[l] = start[i]
        ls = list(labels.keys())
        ls.sort(key=lambda x: labels[x], reverse=True)
        result.append(ls[0]) if len(ls) > 0 else 'dunno'
    return np.asarray(result)

In [166]:
feat_train, feat_test = split(list(coragraph.nodes()))
Y_pred = label_propagation(feat_test)

  return np.asarray(result)


0 of 136
1 of 136
2 of 136
3 of 136
4 of 136
5 of 136
6 of 136
7 of 136
8 of 136
9 of 136
10 of 136
11 of 136
12 of 136
13 of 136
14 of 136
15 of 136
16 of 136
17 of 136
18 of 136
19 of 136
20 of 136
21 of 136
22 of 136
23 of 136
24 of 136
25 of 136
26 of 136
27 of 136
28 of 136
29 of 136
30 of 136
31 of 136
32 of 136
33 of 136
34 of 136
35 of 136
36 of 136
37 of 136
38 of 136
39 of 136
40 of 136
41 of 136
42 of 136
43 of 136
44 of 136
45 of 136
46 of 136
47 of 136
48 of 136
49 of 136
50 of 136
51 of 136
52 of 136
53 of 136
54 of 136
55 of 136
56 of 136
57 of 136
58 of 136
59 of 136
60 of 136
61 of 136
62 of 136
63 of 136
64 of 136
65 of 136
66 of 136
67 of 136
68 of 136
69 of 136
70 of 136
71 of 136
72 of 136
73 of 136
74 of 136
75 of 136
76 of 136
77 of 136
78 of 136
79 of 136
80 of 136
81 of 136
82 of 136
83 of 136
84 of 136
85 of 136
86 of 136
87 of 136
88 of 136
89 of 136
90 of 136
91 of 136
92 of 136
93 of 136
94 of 136
95 of 136
96 of 136
97 of 136
98 of 136
99 of 136
100 of 136

In [168]:
Y_train, Y_test = split(Y)
print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

Confusion matrix: 
[[ 0  9  0  0  0  0  0]
 [ 0  2  1  0  0  0  0]
 [ 0  2 99  0  0  0  0]
 [ 0  0  2  5  0  0  0]
 [ 0  0  7  0  0  0  0]
 [ 0  0  0  0  0  1  1]
 [ 0  0  2  0  0  1  4]]
Accuracy: 0.8161764705882353


In [177]:
feat_train, feat_test, Y_train, Y_test = train_test_split(list(coragraph.nodes),Y, train_size=0.8)
Y_pred = label_propagation(feat_test)
print("Confusion matrix: \n{}".format(confusion_matrix(Y_test,Y_pred)))
print("Accuracy: {}".format(accuracy_score(Y_test,Y_pred)))

  return np.asarray(result)


0 of 542
1 of 542
2 of 542
3 of 542
4 of 542
5 of 542
6 of 542
7 of 542
8 of 542
9 of 542
10 of 542
11 of 542
12 of 542
13 of 542
14 of 542
15 of 542
16 of 542
17 of 542
18 of 542
19 of 542
20 of 542
21 of 542
22 of 542
23 of 542
24 of 542
25 of 542
26 of 542
27 of 542
28 of 542
29 of 542
30 of 542
31 of 542
32 of 542
33 of 542
34 of 542
35 of 542
36 of 542
37 of 542
38 of 542
39 of 542
40 of 542
41 of 542
42 of 542
43 of 542
44 of 542
45 of 542
46 of 542
47 of 542
48 of 542
49 of 542
50 of 542
51 of 542
52 of 542
53 of 542
54 of 542
55 of 542
56 of 542
57 of 542
58 of 542
59 of 542
60 of 542
61 of 542
62 of 542
63 of 542
64 of 542
65 of 542
66 of 542
67 of 542
68 of 542
69 of 542
70 of 542
71 of 542
72 of 542
73 of 542
74 of 542
75 of 542
76 of 542
77 of 542
78 of 542
79 of 542
80 of 542
81 of 542
82 of 542
83 of 542
84 of 542
85 of 542
86 of 542
87 of 542
88 of 542
89 of 542
90 of 542
91 of 542
92 of 542
93 of 542
94 of 542
95 of 542
96 of 542
97 of 542
98 of 542
99 of 542
100 of 542