# Assignment B

We consider the following information spreading process, which is actually a simplified Susceptible-Infected
model but on a temporal network. Initially, at time t = 0, a single node s is infected meaning that this
node possesses the information whereas all the other nodes are Susceptible, thus have not yet perceived the
information. Node s is also called the seed of the information. Whenever an infected node i is in contact with
a susceptible node j at any time step t, the susceptible node becomes infected during the same time step and
could possibly infect other nodes only since the next time step via its contacts with susceptible nodes. Once a
node becomes infected, it stays infected forever. For example, assume that the seed node has its first contact,
e.g. with a node m at time t = 5. Although node s gets infected since t = 0, it infects a second node, i.e. node
m only at t = 5 when it contacts m. Infection happens only when an infected node and a susceptible node are
in contact. The number of infected nodes is non-decreasing over time.

Simulate the information spreading process on the given temporal network G data for N iterations. Each
iteration starts with a different seed node infected at t = 0 and ends at t = T = 57791 the last time step that
the network is measured. Record the number of infected nodes I(t) over time t for each iteration.

In [1]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import os.path
import math

In [20]:
df = pd.read_excel('manufacturing_emails_temporal_network.xlsx')
#G = nx.from_pandas_edgelist(df, source='node1', target='node2',edge_attr=True, create_using=nx.MultiGraph())
number_of_nodes = 167
df

Unnamed: 0,node1,node2,timestamp
0,1,2,1
1,1,3,1
2,1,4,1
3,1,5,1
4,1,6,1
...,...,...,...
82871,3,39,57787
82872,3,39,57788
82873,18,19,57789
82874,3,85,57790


## Q9

Taking all the N iterations into count, plot the average number of infected nodes $E[I(t)]$ together with
its error bar (standard deviation $\sqrt{Var[I(t)]}$) as a function of the time step t

In [3]:
try:
    infected_per_timestep = pickle.load(open("infected_per_timestamp.pickle", "rb"))
except Exception as e:
    print(e)
    print("Rebuilding data")

    infected_per_timestep = np.zeros([number_of_nodes, len(df.index)])

    for n in range(1, number_of_nodes + 1):
        infected = np.zeros(number_of_nodes + 1,dtype=bool)
        infected[n] = True
        infected_count = 1

        for timestep, transaction in df.iterrows():
            node1 = transaction['node1']
            node2 = transaction['node2']
            if infected[node1] and not infected[node2]:
                infected[node2] = True
                infected_count += 1

            infected_per_timestep[n - 1, timestep] = infected_count

        print("Infection of starting node " + str(n) + " finished")
    print("Finished creating data")
print("infected_per_timestep initialized")
infected_per_timestep

infected_per_timestep initialized


array([[  2.,   3.,   4., ..., 167., 167., 167.],
       [  1.,   1.,   1., ..., 162., 162., 162.],
       [  1.,   1.,   1., ..., 165., 165., 165.],
       ...,
       [  1.,   1.,   1., ...,   1.,   1.,   1.],
       [  1.,   1.,   1., ...,   1.,   1.,   1.],
       [  1.,   1.,   1., ...,   1.,   1.,   1.]])

In [4]:
try:
    os.path.isfile('Infected_per_timestep.xlsx')
    print("Excelfile already there")
except Exception as e:
    print(e)
    print("Writing data to excel")
    df_infected_per_timestep = pd.DataFrame(infected_per_timestep.T)
    writer = pd.ExcelWriter('Infected_per_timestep.xlsx')
    df_infected_per_timestep.to_excel(writer, 'Sheet1', index=False)
    writer.save()
    print("Written infected per timestep to excel")



Excelfile already there


In [5]:
avg = [np.average(col) for col in infected_per_timestep.T]
std = [np.std(col) for col in infected_per_timestep.T]

## Q10

How influential a node is as a seed node could be partially reflected by, e.g. the time it takes to
reach/infect 80% of the total nodes when this node is selected as the seed node. The shorter the time is, the more influential the seed node is. Using this standard to rank the influence of all the nodes and record the
ranking in a vector R = [R (1) , R (2) , ..., R (N ) ] where R (i) is the node index of the i − th most influential seed
node and R (1) is the most influential node that infects 80% nodes in the shortest time. Note that you don’t
need to provide this vector in your report

In [122]:
threshold = math.ceil(0.8 * number_of_nodes)

threshold_reached = [np.min(np.where(row > threshold)[0]) 
         if np.where(row > threshold )[0].shape[0] > 0 
         else len(infected_per_timestep.T)                     
         for row in infected_per_timestep]

In [123]:
node_timestamp_pair = np.stack(
    (np.arange(1,168), 
    np.array(threshold_reached)), axis=-1  )
sorted_nodes = node_timestamp_pair[node_timestamp_pair[:,1].argsort()][:,0]

## Q11

We are going to explore which nodal level network feature could well suggest the nodal influence discussed
in 10). Compute the degree and clustering coefficient of each node in the aggregated network G and rank the
importance of the nodes according to these two centrality metrics respectively. You obtain the ordered vector
D = [D (1) , D (2) , ..., D (N ) ] and C = [C (1) , C (2) , ..., C (N ) ], where D (i) is the node having the i − th highest degree
and C (i) is the node with the i − th highest clustering coefficient. How precise a centrality metric e.g. the
|R f ∩D f |
degree could predict seed nodes’ influence could be quantified by the top f recognition rate r RD (f ) = |R
f |
where R f and D f are the sets of nodes ranking in the top f fraction according to their influence and degree
respectively and |R f | = f N is the number of nodes in R f . Plot r RD (f ) and r RC (f ) as a function of f where
f = 0.05, 0.1, 0.15, ..., 0.5. Which metric, the degree or the clustering coefficient could better predict the influence
of the nodes? Why?