# High school network and deep learning

Our point is, considering the positive and negative networks as two different directed graphs, use a GConv network with previous autoencoder (Tutorial 12 Pytorch Geometric) to predict links from one of the networks, then compare the predicted_edges with the negative ones. Consider the positive as the training set and compare it to the negatives in the test one. 

* We are going to use a graph autoencoder, which is a non-supervised neural network that takes data, translate them to another representation (the one the neural network extracts from them) and then try to rebuild the original data. The representation it learns is based on the structure of the network.

* We will use also a heuristic method, called PageRank method, traditionally used in link prediction, where the probability of a link depends on a variable called rank. 

* We compare it with a null method, a random graph. 

In both of the comparative methods of the GNN, negative links can only be placed where there are not positive links. 

The graph autoencoder generate a fixed number of links depending on built-in functions, so we are taking these number of links in order to establish comparison with other methods.

<div class="alert alert-success" role="alert">
  In this variation we are going to remove a whole class and let the neural network rebuild it. We are considering classes as main components of the whole. This is done just in the positive network. 
</div>

In [1]:
import numpy as np
import pandas as pd 


import torch
import torch_geometric.data as data
from torch_geometric.nn import GCNConv
import torch_geometric.transforms as T
import torch.nn.functional as F
from torch_geometric.utils import negative_sampling,train_test_split_edges,to_dense_adj
from sklearn.metrics import roc_auc_score
from torch_geometric.transforms import RandomLinkSplit
from sklearn import preprocessing

device = "cpu"

In [2]:
import networkx as nx
## Just prepare the data
nodes = pd.read_csv(r"Nodes_t1.csv",sep=";",encoding = 'unicode_escape')
edges = pd.read_csv(r"Edges_t1.csv",sep=";",encoding = 'unicode_escape')
edges = edges.apply(lambda x: x - x.min(),axis = 0)
###Erase ESO 
nodes["Curso"] = nodes["Curso"].astype(str).str[0].astype("int64")
del nodes["Unnamed: 0"]
edges["weight"] = edges["weight"].apply(lambda x:x+1)
pos_edges = edges[edges["weight"]> 3]
neg_edges = edges[edges["weight"]< 3]
G_positive = nx.from_pandas_edgelist(pos_edges, "from", "to",create_using=nx.DiGraph,edge_attr="weight")
G_negative = nx.from_pandas_edgelist(neg_edges, "from", "to",create_using=nx.DiGraph,edge_attr="weight")
G_negative.add_nodes_from(range(nodes.index.max()+1))

<div class="alert alert-success" role="alert">
    We need to remove from the data the edges at a certain distance of a random selected student. We then eliminate all the edges that come from that student, at then from the contacts of that student, up to a a cutoff.
</div

In [3]:
import random as rd
selected_student = [rd.choice(edges["from"].unique())]
d_cutoff = 3
d = 0
edges_saved = []
while d < d_cutoff:
    for student in selected_student:
        selected_edges = edges[["from","to"]][edges["from"] == student]
        nei = []
        for row in selected_edges.iterrows():
            nei.append(row[1]["to"])
            edges_saved.append([row[1]["from"],row[1]["to"]])
    selected_student = nei
    d += 1
edges_saved = torch.Tensor(list(zip(*edges_saved))).type(torch.LongTensor)
#list(set(list(zip(*edges_saved))) & set(edges[["from","to"]].to_numpy()))
saved_df = pd.DataFrame({"from":edges_saved[0],"to":edges_saved[1]})
edges = pd.concat([edges[["from","to"]],saved_df]).drop_duplicates(keep=False)
nodes_class = list(saved_df["from"].unique())

## Graph autoencoders

### Load the dataset

In [4]:
nodes = pd.read_csv(r"Nodes_t1.csv",sep=";",encoding = 'unicode_escape')
edges = pd.read_csv(r"Edges_t1.csv",sep=";",encoding = 'unicode_escape')
edges = edges.apply(lambda x: x - x.min(),axis = 0)
###Erase ESO 
nodes["Curso"] = nodes["Curso"].astype(str).str[0].astype("int64")
del nodes["Unnamed: 0"]
### Separate positive from negative networks
pos_edges = edges[edges["weight"]> 2]
neg_edges = edges[edges["weight"]< 2] 
### One hot encode and normalize node attributes
nodes_dummy = pd.get_dummies(nodes[["Curso","Grupo"]])
rng = np.random.default_rng()
#nodes_dummy = pd.DataFrame(rng.integers(0, 2, size=(409, 10)), columns=list('ABCDEFGHIJ'))

x = nodes_dummy.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
nodes_norm = pd.DataFrame(x_scaled)

In [5]:
###Without including class and group information 
positive_data = data.Data(x=torch.tensor(nodes_norm.to_numpy(),dtype=torch.float32),
                          edge_index=torch.tensor(pos_edges[["from","to"]].to_numpy().T))
negative_data = data.Data(x=torch.tensor(nodes_norm.to_numpy(),dtype=torch.float32),
                          edge_index=torch.tensor(neg_edges[["from","to"]].to_numpy().T))

In [6]:
data = positive_data.clone()
data.num_nodes = len(data._store["x"])
data = train_test_split_edges(data)




### Models for the neural network

#### Autoencoder

In [7]:
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(data.num_features, 128)
        self.conv2 = GCNConv(128, 64)

    def encode(self):
        x = self.conv1(data.x, data.train_pos_edge_index) # convolution 1
        x = x.relu()
        return self.conv2(x, data.train_pos_edge_index) # convolution 2

    def decode(self, z, pos_edge_index, neg_edge_index): # only pos and neg edges
        edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1) # concatenate pos and neg edges
        logits = (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)  # dot product 
        return logits

    def decode_all(self, z): 
        prob_adj = z @ z.t() # get adj NxN
        return (prob_adj > 1-10e-10).nonzero(as_tuple=False).t() # get predicted edge_list 

#### Set the parameters and move data to autoencoder

In [8]:
model, positive_data = Net().to(device), positive_data.to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)

#### Algorithms of training and evaluation (Tutorial PyG)

In [9]:

def get_link_labels(pos_edge_index, neg_edge_index):
    # returns a tensor:
    # [1,1,1,1,...,0,0,0,0,0,..] with the number of ones is equel to the lenght of pos_edge_index
    # and the number of zeros is equal to the length of neg_edge_index
    E = pos_edge_index.size(1) + neg_edge_index.size(1)
    link_labels = torch.zeros(E, dtype=torch.float, device=device)
    link_labels[:pos_edge_index.size(1)] = 1.
    return link_labels


def train():
    model.train()

    neg_edge_index = negative_sampling(
        edge_index=data.train_pos_edge_index, #positive edges
        num_nodes=data.num_nodes, # number of nodes
        num_neg_samples=data.train_pos_edge_index.size(1)) # number of neg_sample equal to number of pos_edges

    optimizer.zero_grad()
    
    z = model.encode() #encode
    link_logits = model.decode(z, data.train_pos_edge_index, neg_edge_index) # decode
    
    link_labels = get_link_labels(data.train_pos_edge_index, neg_edge_index)
    loss = F.binary_cross_entropy_with_logits(link_logits, link_labels)
    loss.backward()
    optimizer.step()

    return loss


@torch.no_grad()
def test():
    model.eval()
    perfs = []
    for prefix in ["val", "test"]:
        pos_edge_index = data[f'{prefix}_pos_edge_index']
        neg_edge_index = data[f'{prefix}_neg_edge_index']

        z = model.encode() # encode train
        link_logits = model.decode(z, pos_edge_index, neg_edge_index) # decode test or val
        link_probs = link_logits.sigmoid() # apply sigmoid
        
        link_labels = get_link_labels(pos_edge_index, neg_edge_index) # get link
        
        perfs.append(roc_auc_score(link_labels.cpu(), link_probs.cpu())) #compute roc_auc score
    return perfs


#### Training and test

In [10]:
best_val_perf = test_perf = 0
for epoch in range(1, 2001):
    train_loss = train()
    val_perf, tmp_test_perf = test()
    if val_perf > best_val_perf:
        best_val_perf = val_perf
        test_perf = tmp_test_perf
    log = 'Epoch: {:03d}, Loss: {:.4f}, Val: {:.4f}, Test: {:.4f}'
    if epoch % 100 == 0:
        print(log.format(epoch, train_loss, best_val_perf, test_perf))

Epoch: 100, Loss: 0.4471, Val: 0.9050, Test: 0.9161
Epoch: 200, Loss: 0.4390, Val: 0.9134, Test: 0.9239
Epoch: 300, Loss: 0.4256, Val: 0.9134, Test: 0.9239
Epoch: 400, Loss: 0.4325, Val: 0.9157, Test: 0.9291
Epoch: 500, Loss: 0.4188, Val: 0.9178, Test: 0.9299
Epoch: 600, Loss: 0.4285, Val: 0.9178, Test: 0.9299
Epoch: 700, Loss: 0.4113, Val: 0.9178, Test: 0.9299
Epoch: 800, Loss: 0.4110, Val: 0.9206, Test: 0.9297
Epoch: 900, Loss: 0.4120, Val: 0.9239, Test: 0.9323
Epoch: 1000, Loss: 0.4217, Val: 0.9240, Test: 0.9317
Epoch: 1100, Loss: 0.4100, Val: 0.9250, Test: 0.9307
Epoch: 1200, Loss: 0.4030, Val: 0.9252, Test: 0.9324
Epoch: 1300, Loss: 0.4108, Val: 0.9259, Test: 0.9353
Epoch: 1400, Loss: 0.4020, Val: 0.9266, Test: 0.9335
Epoch: 1500, Loss: 0.4150, Val: 0.9271, Test: 0.9330
Epoch: 1600, Loss: 0.3999, Val: 0.9282, Test: 0.9361
Epoch: 1700, Loss: 0.4052, Val: 0.9293, Test: 0.9382
Epoch: 1800, Loss: 0.4003, Val: 0.9293, Test: 0.9382
Epoch: 1900, Loss: 0.4067, Val: 0.9305, Test: 0.9399
Ep

<div class = "alert alert-success">
    We will produce the links, and from them extract the ones referred to the particular course we are trying to rebuild. 
</div>

In [11]:
z = model.encode()
final_edge_index_1 = model.decode_all(z)
#Remove self loops
bool_mask = final_edge_index_1[0] != final_edge_index_1[1]
simulated_edges_1 = torch.empty((2,int(sum(bool_mask))))
for item in range(final_edge_index_1.size()[0]):
    simulated_edges_1[item] = final_edge_index_1[item][bool_mask]
    


In [12]:
edges_saved_compare = list(zip(*edges_saved))
edges_saved_compare = set(list(map(lambda item: (int(item[0]),int(item[1])),edges_saved_compare)))

simulated_edges_compare = list(zip(*simulated_edges_1))
simulated_edges_compare = set(list(map(lambda item: (int(item[0]),int(item[1])),simulated_edges_compare)))

In [13]:
precision = round(len(edges_saved_compare.intersection(simulated_edges_compare))/len(edges_saved[0]),2)
print("The amount of links that were in the data is the {} of the total".format(precision))

The amount of links that were in the data is the 0.73 of the total


<div class = "alert alert-success">
   Compute the coincidences. To do this, we check if each of the predicted links was in the original data. 
</div>

In [14]:
n_links = 0
for item in simulated_edges_compare: 
    if item[0] in nodes_class: 
        n_links += 1
n_links,len(nodes_class)

(1259, 23)

In [15]:
print(f"We generated {len(simulated_edges_compare)} links, {n_links} of them are in the cut region ")
print(f"We cut a total of {len(edges_saved_compare)} links")
print(f"From {n_links}, {int(precision*len(edges_saved_compare))} of {len(edges_saved_compare)} were found in the simulated links. ")

We generated 22550 links, 1259 of them are in the cut region 
We cut a total of 497 links
From 1259, 362 of 497 were found in the simulated links. 
