## Active Learning Method

The idea is to remove regular edges from the graph to make it harder to predict which links will exist. The irregularity of links is determined by the frequency with which they are used in random walks. The idea is that irregular links are more likely to be structurally important.

Get a random set of 0.2*n starting noeds, and do a random walk of length m from each one. The links in the walk are tracked, and each time an edge is used, its importance increases. The beta * m most regular links are removed. 

This algorithm is probabilistic.
Time complexity is about O(n* max iter) so O(n^3) when using number of edges, but O(n) when using static maxIter
Stated time complexity is O(n*m) because of O(n) number random walks, each with O(m) length.

In [18]:
import networkx as nx
import scipy
import random
import numpy as np

In [37]:
from collections import Counter

In [71]:
def linkImportanceMeasuring(G:nx.Graph, c = 0.1, alpha = 0.2):
    '''returns ranked edge list Rlist in ascending order of importance'''
    n = G.number_of_nodes()
    N = alpha * n
    maxIter = 100 ##G.number_of_edges()
    
    Pt = np.full(n, 1/n) ## P is the distribution to choose from, Pt = P0
    ## Pt+1 = (1-c)S^T Pt + c/n * 1 where 1 a suitable 1 vector, S = DA s.t. D=1/di 
    ## is the probability of ending a RW on a node at time t -> no starting node given
    A = nx.adjacency_matrix(G)
    D = scipy.sparse.diags_array([1/d for d in dict(G.degree()).values()])
    S = D@A
    covern_1 = np.full(n, c/n)  

    Rt = random.choices(range(n), weights = Pt, k = n)

    W = []
    for t in range(maxIter):
        Ptplus1 = (1-c)* S.T@Pt + covern_1
        Rtplus1 = random.choices(range(n), weights = Ptplus1, k = n)
        W += [i for i in zip(Rt, Rtplus1)]
        Pt = Ptplus1
        Rt = Rtplus1
    ## based on my understanding of the paper, we sample at each step t from the distribution
    ## we do not consider conditional probability

    W = [(min(u,v), max(u,v)) for (u,v) in W]
    ## update link importance matrix Q based on W, and get list Rlist in order
    c = Counter(W)
    Rlist = sorted(list(G.edges()), key = lambda e: c[e])
    return Rlist


In [76]:
def optimizationLinkPerturbation(G:nx.Graph, beta = 0.1):
    '''Takes G and perturbation ratio beta. 
    returns graph with only the 1-beta most important edges'''
    Rlist = linkImportanceMeasuring(G)
    Gstar = G.copy()
    Gstar.remove_edges_from(Rlist[:int(beta*G.number_of_edges())])
    return Gstar

In [69]:
BAG = nx.barabasi_albert_graph(10000,500)

In [70]:
BAGanon = optimizationLinkPerturbation(BAG)

9.162781953811646
14.430991888046265


In [73]:
import datetime

In [74]:
HBA_10k_82 = nx.read_gml("../Data/HBA/HBA_10k_82.gml", destringizer=int)

In [78]:
for i in range(5):
    start = datetime.datetime.now()
    HBA_10k_82_anon = optimizationLinkPerturbation(HBA_10k_82)
    nx.write_adjlist(HBA_10k_82_anon, "../anonymisedGraphs/HBA_10k_82_OLP_29-06_v"+str(i))
    print(datetime.datetime.now()-start)

0:00:02.481199
0:00:02.024303
0:00:02.023517
0:00:02.084436
0:00:02.025416
