# High school network and deep learning

Our point is, considering the positive and negative networks as two different directed graphs, use a GConv network with previous autoencoder (Tutorial 12 Pytorch Geometric) to predict links from one of the networks, then compare the predicted_edges with the negative ones. Consider the positive as the training set and compare it to the negatives in the test one. 

* We are going to use a graph autoencoder, which is a non-supervised neural network that takes data, translate them to another representation (the one the neural network extracts from them) and then try to rebuild the original data. The representation it learns is based on the structure of the network.

* We will use also a heuristic method, called PageRank method, traditionally used in link prediction, where the probability of a link depends on a variable called rank. 

* We compare it with a null method, a random graph. 

The graph autoencoder generate a fixed number of links depending on built-in functions, so we are taking these number of links in order to establish comparison with other methods.

In [1]:
import networkx as nx
import pandas as pd
## Just prepare the data
nodes = pd.read_csv(r"Nodes_t1.csv",sep=";",encoding = 'unicode_escape')
edges = pd.read_csv(r"Edges_t1.csv",sep=";",encoding = 'unicode_escape')
edges = edges.apply(lambda x: x - x.min(),axis = 0)
###Erase ESO 
nodes["Curso"] = nodes["Curso"].astype(str).str[0].astype("int64")
del nodes["Unnamed: 0"]
edges["weight"] = edges["weight"].apply(lambda x:x+1)
pos_edges = edges[edges["weight"]> 2]
neg_edges = edges[edges["weight"]< 2]
G_positive = nx.from_pandas_edgelist(pos_edges, "from", "to",create_using=nx.DiGraph,edge_attr="weight")
G_negative = nx.from_pandas_edgelist(neg_edges, "from", "to",create_using=nx.DiGraph,edge_attr="weight")
G_negative.add_nodes_from(range(nodes.index.max()+1))

In [2]:

from networkx.algorithms import isomorphism

DiGM = isomorphism.DiGraphMatcher(G_positive,G_negative)

DiGM.is_isomorphic()

False

## Graph autoencoders

### Load the dataset

In [3]:
import numpy as np
import pandas as pd 


import torch
import torch_geometric.data as data
from torch_geometric.nn import GCNConv
import torch_geometric.transforms as T
import torch.nn.functional as F
from torch_geometric.utils import negative_sampling,train_test_split_edges,to_dense_adj
from sklearn.metrics import roc_auc_score
from torch_geometric.transforms import RandomLinkSplit
from sklearn import preprocessing

device = "cpu"

In [4]:
nodes = pd.read_csv(r"Nodes_t1.csv",sep=";",encoding = 'unicode_escape')
edges = pd.read_csv(r"Edges_t1.csv",sep=";",encoding = 'unicode_escape')
edges = edges.apply(lambda x: x - x.min(),axis = 0)
###Erase ESO 
nodes["Curso"] = nodes["Curso"].astype(str).str[0].astype("int64")
del nodes["Unnamed: 0"]
### Separate positive from negative networks
pos_edges = edges[edges["weight"]> 2]
neg_edges = edges[edges["weight"]< 2] 
### One hot encode and normalize node attributes
nodes_dummy = pd.get_dummies(nodes[["Curso","Grupo"]])
rng = np.random.default_rng()
#nodes_dummy = pd.DataFrame(rng.integers(0, 2, size=(409, 10)), columns=list('ABCDEFGHIJ'))

x = nodes_dummy.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
nodes_norm = pd.DataFrame(x_scaled)

#x = nodes_dummy_2.values #returns a numpy array
#min_max_scaler = preprocessing.MinMaxScaler()
#x_scaled = min_max_scaler.fit_transform(x)
#nodes_norm = pd.DataFrame(x_scaled)

### Firstly, check for isomorphism with Networkx 

Networkx has a isomorphism library that comes mainly from the VF2 algorithm : https://www.researchgate.net/publication/200034365_An_Improved_Algorithm_for_Matching_Large_Graphs


In [5]:
from networkx.algorithms import isomorphism


DiGM = isomorphism.DiGraphMatcher(G_positive,G_negative)

print("The graph of positive links is direcly isomorphic to the negative one ? {}.".format(DiGM.is_isomorphic()))

The graph of positive links is direcly isomorphic to the negative one ? False.


In [6]:
###Without including class and group information 
positive_data = data.Data(x=torch.tensor(nodes_norm.to_numpy(),dtype=torch.float32),
                          edge_index=torch.tensor(pos_edges[["from","to"]].to_numpy().T))
negative_data = data.Data(x=torch.tensor(nodes_norm.to_numpy(),dtype=torch.float32),
                          edge_index=torch.tensor(neg_edges[["from","to"]].to_numpy().T))

In [7]:
data = positive_data.clone()
data.num_nodes = len(data._store["x"])
data = train_test_split_edges(data)




### Models for the neural network

#### Autoencoder

In [8]:
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(data.num_features, 128)
        self.conv2 = GCNConv(128, 64)

    def encode(self):
        x = self.conv1(data.x, data.train_pos_edge_index) # convolution 1
        x = x.relu()
        return self.conv2(x, data.train_pos_edge_index) # convolution 2

    def decode(self, z, pos_edge_index, neg_edge_index): # only pos and neg edges
        edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1) # concatenate pos and neg edges
        logits = (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)  # dot product 
        return logits

    def decode_all(self, z): 
        prob_adj = z @ z.t() # get adj NxN
        return (prob_adj > 1-10e-10).nonzero(as_tuple=False).t() # get predicted edge_list 

#### Set the parameters and move data to autoencoder

In [9]:
model, positive_data = Net().to(device), positive_data.to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)

#### Algorithms of training and evaluation (Tutorial PyG)

In [10]:

def get_link_labels(pos_edge_index, neg_edge_index):
    # returns a tensor:
    # [1,1,1,1,...,0,0,0,0,0,..] with the number of ones is equel to the lenght of pos_edge_index
    # and the number of zeros is equal to the length of neg_edge_index
    E = pos_edge_index.size(1) + neg_edge_index.size(1)
    link_labels = torch.zeros(E, dtype=torch.float, device=device)
    link_labels[:pos_edge_index.size(1)] = 1.
    return link_labels


def train():
    model.train()

    neg_edge_index = negative_sampling(
        edge_index=data.train_pos_edge_index, #positive edges
        num_nodes=data.num_nodes, # number of nodes
        num_neg_samples=data.train_pos_edge_index.size(1)) # number of neg_sample equal to number of pos_edges

    optimizer.zero_grad()
    
    z = model.encode() #encode
    link_logits = model.decode(z, data.train_pos_edge_index, neg_edge_index) # decode
    
    link_labels = get_link_labels(data.train_pos_edge_index, neg_edge_index)
    loss = F.binary_cross_entropy_with_logits(link_logits, link_labels)
    loss.backward()
    optimizer.step()

    return loss


@torch.no_grad()
def test():
    model.eval()
    perfs = []
    for prefix in ["val", "test"]:
        pos_edge_index = data[f'{prefix}_pos_edge_index']
        neg_edge_index = data[f'{prefix}_neg_edge_index']

        z = model.encode() # encode train
        link_logits = model.decode(z, pos_edge_index, neg_edge_index) # decode test or val
        link_probs = link_logits.sigmoid() # apply sigmoid
        
        link_labels = get_link_labels(pos_edge_index, neg_edge_index) # get link
        
        perfs.append(roc_auc_score(link_labels.cpu(), link_probs.cpu())) #compute roc_auc score
    return perfs


#### Training and test

In [11]:
best_val_perf = test_perf = 0
for epoch in range(1, 2001):
    train_loss = train()
    val_perf, tmp_test_perf = test()
    if val_perf > best_val_perf:
        best_val_perf = val_perf
        test_perf = tmp_test_perf
    log = 'Epoch: {:03d}, Loss: {:.4f}, Val: {:.4f}, Test: {:.4f}'
    if epoch % 100 == 0:
        print(log.format(epoch, train_loss, best_val_perf, test_perf))

Epoch: 100, Loss: 0.4560, Val: 0.9138, Test: 0.9211
Epoch: 200, Loss: 0.4478, Val: 0.9259, Test: 0.9194
Epoch: 300, Loss: 0.4354, Val: 0.9288, Test: 0.9192
Epoch: 400, Loss: 0.4280, Val: 0.9306, Test: 0.9251
Epoch: 500, Loss: 0.4237, Val: 0.9327, Test: 0.9253
Epoch: 600, Loss: 0.4166, Val: 0.9333, Test: 0.9291
Epoch: 700, Loss: 0.4142, Val: 0.9351, Test: 0.9302
Epoch: 800, Loss: 0.4293, Val: 0.9351, Test: 0.9302
Epoch: 900, Loss: 0.4141, Val: 0.9351, Test: 0.9302
Epoch: 1000, Loss: 0.4238, Val: 0.9358, Test: 0.9296
Epoch: 1100, Loss: 0.4091, Val: 0.9358, Test: 0.9296
Epoch: 1200, Loss: 0.4052, Val: 0.9367, Test: 0.9326
Epoch: 1300, Loss: 0.4129, Val: 0.9370, Test: 0.9321
Epoch: 1400, Loss: 0.4081, Val: 0.9377, Test: 0.9340
Epoch: 1500, Loss: 0.4104, Val: 0.9386, Test: 0.9341
Epoch: 1600, Loss: 0.4087, Val: 0.9386, Test: 0.9341
Epoch: 1700, Loss: 0.4083, Val: 0.9395, Test: 0.9318
Epoch: 1800, Loss: 0.4136, Val: 0.9403, Test: 0.9353
Epoch: 1900, Loss: 0.3966, Val: 0.9404, Test: 0.9280
Ep

In [12]:
z = model.encode()
final_edge_index_1 = model.decode_all(z)
#Remove self loops
bool_mask = final_edge_index_1[0] != final_edge_index_1[1]
simulated_edges_1 = torch.empty((2,int(sum(bool_mask))))
for item in range(final_edge_index_1.size()[0]):
    simulated_edges_1[item] = final_edge_index_1[item][bool_mask]
    

In [13]:
coincidences = to_dense_adj(negative_data["edge_index"]).squeeze()*to_dense_adj(final_edge_index_1).squeeze()
pos_edges = positive_data.edge_index.size()[1]
neg_edges = negative_data.edge_index.size()[1]
metric = coincidences.sum()/negative_data.edge_index.size()[1]
print("The total number of available links are {}".format(409*408))
print("The positive (negative) network has {} ({}) links ".format(pos_edges,neg_edges))
print("The total amount of generated links are {}, and {:4d} of them are in the negative network ".format(final_edge_index_1.size()[1],int(coincidences.sum())))
print("This is a {:.2f} % of the total links in the negative network ".format(metric*100))
print("The probability of predicting correctly a link is {:.2f} % in the case of the neural network".format(100*int(coincidences.sum())/final_edge_index_1.size()[1]))
coin_GNN_neg = coincidences.sum()

The total number of available links are 166872
The positive (negative) network has 7302 (1255) links 
The total amount of generated links are 23655, and  567 of them are in the negative network 
This is a 45.18 % of the total links in the negative network 
The probability of predicting correctly a link is 2.40 % in the case of the neural network


### The other way around

In [14]:

data = negative_data.clone()
data.num_nodes = len(data._store["x"])
data = train_test_split_edges(data)

model, positive_data = Net().to(device), positive_data.to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)



In [15]:
best_val_perf = test_perf = 0
for epoch in range(1, 2001):
    train_loss = train()
    val_perf, tmp_test_perf = test()
    if val_perf > best_val_perf:
        best_val_perf = val_perf
        test_perf = tmp_test_perf
    log = 'Epoch: {:03d}, Loss: {:.4f}, Val: {:.4f}, Test: {:.4f}'
    if epoch % 100 == 0:
        print(log.format(epoch, train_loss, best_val_perf, test_perf))

Epoch: 100, Loss: 0.4412, Val: 0.8431, Test: 0.9697
Epoch: 200, Loss: 0.4005, Val: 0.8526, Test: 0.9303
Epoch: 300, Loss: 0.4302, Val: 0.8526, Test: 0.9303
Epoch: 400, Loss: 0.3904, Val: 0.8526, Test: 0.9303
Epoch: 500, Loss: 0.4002, Val: 0.8526, Test: 0.9303
Epoch: 600, Loss: 0.4161, Val: 0.8526, Test: 0.9303
Epoch: 700, Loss: 0.3948, Val: 0.8526, Test: 0.9303
Epoch: 800, Loss: 0.4014, Val: 0.8526, Test: 0.9303
Epoch: 900, Loss: 0.4099, Val: 0.8526, Test: 0.9303
Epoch: 1000, Loss: 0.3942, Val: 0.8526, Test: 0.9303
Epoch: 1100, Loss: 0.3924, Val: 0.8526, Test: 0.9303
Epoch: 1200, Loss: 0.3869, Val: 0.8526, Test: 0.9303
Epoch: 1300, Loss: 0.3765, Val: 0.8526, Test: 0.9303
Epoch: 1400, Loss: 0.4001, Val: 0.8526, Test: 0.9303
Epoch: 1500, Loss: 0.3921, Val: 0.8526, Test: 0.9303
Epoch: 1600, Loss: 0.3790, Val: 0.8526, Test: 0.9303
Epoch: 1700, Loss: 0.3858, Val: 0.8526, Test: 0.9303
Epoch: 1800, Loss: 0.3863, Val: 0.8526, Test: 0.9303
Epoch: 1900, Loss: 0.4025, Val: 0.8526, Test: 0.9303
Ep

In [16]:
z = model.encode()
final_edge_index_2 = model.decode_all(z)
#Remove self loops
bool_mask = final_edge_index_2[0] != final_edge_index_2[1]
simulated_edges_2 = torch.empty((2,int(sum(bool_mask))))
for item in range(final_edge_index_2.size()[0]):
    simulated_edges_2[item] = final_edge_index_2[item][bool_mask]

In [17]:
coincidences = to_dense_adj(positive_data["edge_index"]).squeeze()*to_dense_adj(final_edge_index_2).squeeze()
pos_edges = positive_data.edge_index.size()[1]
neg_edges = negative_data.edge_index.size()[1]
metric = coincidences.sum()/positive_data.edge_index.size()[1]
print("The total number of available links are {}".format(409*408))
print("The positive (negative) network has {} ({}) links ".format(pos_edges,neg_edges))
print("The total amount of generated links are {}, and {:4d} of them are in the positive network ".format(final_edge_index_2.size()[1],int(coincidences.sum())))
print("This is a {:.2f} % of the total links in the positive network ".format(metric*100))
print("The probability of predicting correctly a link {:.2f} % in the neural network".format(100*int(coincidences.sum())/final_edge_index_2.size()[1]))
coin_GNN_pos = coincidences.sum()

The total number of available links are 166872
The positive (negative) network has 7302 (1255) links 
The total amount of generated links are 11855, and 2289 of them are in the positive network 
This is a 31.35 % of the total links in the positive network 
The probability of predicting correctly a link 19.31 % in the neural network


### Link prediction with PageRank 

From the paper of _Alain Barrat_ Anxo recommended (_New Insights and Methods forPredicting Face-to-Face Contacts_), it can be checked the  _Hybrid Rooted PageRank_. We implement it in the following: 

*  With probability $\alpha$ jump to root node _r_.
*  With probability $1−\alpha$:
    *  Choose Network $N_{i}∈N$ with respect toprobability distribution _P_.
    *  If there exist no outgoing edges then :
    * Jump to root node _r_
    *  Else:
        From the current node c jump to a neighbornselected with a probability $w(c,n)∑c→dw(c,d)$, i. e.,proportional to the weight $w(c,n)$ of the $e(c,n)$

But we will include modifications on this analysis, as _Barrat et al_ use two networks in order to extract a single social network, while we are trying to deduce one from the other. We will implement PageRank on one of them and predict the links of the other one based on this quantity.

In [18]:
import random as rd
def HR_pagerank(alpha,G):
#alpha = 0.1
    N_rounds = 10000
    rank = [0]*len(G.nodes())
    for rounds in range(N_rounds):
        for node in range(409):
            a = rd.uniform(0,1)
            site = list(G.nodes())[node]
            targets = list(G.nodes())
            targets.remove(site)
            #print(site,targets)
            if a > alpha:
                target = rd.choice(targets)
                if target in list(G.neighbors(site)):
                    c = rd.uniform(0,1)
                    weight_target = G[site][target]["weight"]
                    weight=nx.get_edge_attributes(G,'weight')
                    av_weights = 0
                    for n in list(G.neighbors(site)):
                        av_weights += weight[(site,n)]
                    av_weights /= len(list(G.neighbors(site)))
                    if c<((weight_target)/(av_weights)):
                        site = target
                        rank[site] +=1
    rank = [item/N_rounds for item in rank]
    return rank

def create_link(G,rank):
    index_pair = rd.sample(range(len(rank)),2)
    rd_pair = [rank[item] for item in index_pair]
    p_rank = 1/(1 + np.exp(-(rd_pair[0]-rd_pair[1])))
    if rd.uniform(0,1) < p_rank: 
        G.add_edge(index_pair[0],index_pair[1])
    return 

In [19]:
positive_rank = HR_pagerank(0.15,G_positive)
G_simulated = nx.DiGraph()
while len(G_simulated.edges())< final_edge_index_2.size()[1]:
    create_link(G_simulated,positive_rank)

In [20]:
coincidences = to_dense_adj(negative_data["edge_index"]).squeeze()*torch.tensor(nx.adjacency_matrix(G_simulated).todense())
pos_edges = positive_data.edge_index.size()[1]
neg_edges = negative_data.edge_index.size()[1]
metric = coincidences.sum()/positive_data.edge_index.size()[1]
print("The total number of available links are {}".format(409*408))
print("The positive (negative) network has {} ({}) links ".format(pos_edges,neg_edges))
print("The total amount of generated links are {}, and {:4d} of them are in the negative network ".format(final_edge_index_2.size()[1],int(coincidences.sum())))
print("This is a {:.2f} % of the total links in the negative network ".format(metric*100))
print("The probability of predicting correctly a link at random is {:.2f} % versus a {:.2f} % of the heuristics".format(
     neg_edges*100/(409*408),100*int(coincidences.sum())/final_edge_index_2.size()[1]))
coin_rank_pos = coincidences.sum()

The total number of available links are 166872
The positive (negative) network has 7302 (1255) links 
The total amount of generated links are 11855, and  102 of them are in the negative network 
This is a 1.40 % of the total links in the negative network 
The probability of predicting correctly a link at random is 0.75 % versus a 0.86 % of the heuristics


In [21]:
negative_rank = HR_pagerank(0.15,G_negative)
G_simulated = nx.DiGraph()
while len(G_simulated.edges())< final_edge_index_1.size()[1]:
    create_link(G_simulated,negative_rank)

In [22]:
coincidences = to_dense_adj(positive_data["edge_index"]).squeeze()*torch.tensor(nx.adjacency_matrix(G_simulated).todense())
pos_edges = positive_data.edge_index.size()[1]
neg_edges = negative_data.edge_index.size()[1]
metric = coincidences.sum()/positive_data.edge_index.size()[1]
print("The total number of available links are {}".format(409*408))
print("The positive (negative) network has {} ({}) links ".format(pos_edges,neg_edges))
print("The total amount of generated links are {}, and {:4d} of them are in the negative network ".format(final_edge_index_2.size()[1],int(coincidences.sum())))
print("This is a {:.2f} % of the total links in the negative network ".format(metric*100))
print("The probability of predicting correctly a link at random is {:.2f} % versus a {:.2f} % of the heuristics".format(
     pos_edges*100/(409*408),100*int(coincidences.sum())/final_edge_index_1.size()[1]))
coin_rank_neg = coincidences.sum()

The total number of available links are 166872
The positive (negative) network has 7302 (1255) links 
The total amount of generated links are 11855, and 1087 of them are in the negative network 
This is a 14.89 % of the total links in the negative network 
The probability of predicting correctly a link at random is 4.38 % versus a 4.60 % of the heuristics


### Randomly created network

We compare the results from the GNN and the PageRank with a randomly created network. 

In [23]:
import random as rd 
coincidences_total = 0
for sim in range(100):
    G_random = nx.DiGraph()
    G_random.add_nodes_from(range(409))
    for trial in range(final_edge_index_2.size()[1]):
        rd_sample = rd.sample(range(409),2)
        G_random.add_edge(rd_sample[0],rd_sample[1]) 

    coincidences_random = len([(u,v) for (u,v) in G_random.edges() if G_positive.has_edge(u,v)])
    coincidences_total += coincidences_random
print("There is an average of {:.2f} coincidences.".format(coincidences_total/100))
coin_random_pos = coincidences_total/100

There is an average of 506.99 coincidences.


In [24]:
import random as rd 
coincidences_total = 0
for sim in range(100):
    G_random = nx.DiGraph()
    G_random.add_nodes_from(range(409))
    for trial in range(final_edge_index_2.size()[1]):
        rd_sample = rd.sample(range(409),2)
        G_random.add_edge(rd_sample[0],rd_sample[1]) 

    coincidences_random = len([(u,v) for (u,v) in G_random.edges() if G_negative.has_edge(u,v)])
    coincidences_total += coincidences_random
print("There is an average of {:.2f} coincidences.".format(coincidences_total/100))
coin_random_neg = coincidences_total/100

There is an average of 25.16 coincidences.


### Results compared in coincidences with the original networks

We write $+/-$ as the prediction power of the negative network depending on the positive one and $-/+$ to design the other way around. Results are expresed in terms of the number of links the method is able to reproduce in the original networks. 

|  | +/- | -/+ |
| --- | --- | --- |
| **Random network** | 0.031 | 0.105 |
| --- | --- | --- |
| **PageRank heuristics** | 0.106 | 0.167 |
| --- | --- | --- |
| **GCN**  | 0.391 | 0.177 |
| --- | --- | --- |
| **GCN with class/group info** | 0.426 | 0.305 |


The quantities may diverge a little bit because the autoencoder generate a different number of links depending on the realization, but it is clear that Graph Convolutional Networks outperform the heuristics used in link prediction. It is difficult to say something about the structural relationship between both networks, but there is some kind of relationship, as there is a difference in scoring for all the methods. 

**Work to be done** 

1) Check the structural balance theory, computing global equilibria in both networks, in order to generate ensembles. 

2) Graph neural networks can also be used to predict labeling in edges, it could be used to proof structural balance theory from other perspective. 

