## 1. Basics of Markov chain models

Consider a grapph $G = (V, E)$ with the following adjacency matrix:

$$
A = \begin{bmatrix}
    0 & 1 & 0 & 1 & 1 \\
    1 & 0 & 1 & 2 & 0 \\
    0 & 1 & 0 & 0 & 0 \\
    1 & 2 & 0 & 0 & 0 \\
    1 & 0 & 0 & 0 & 0
\end{bmatrix}$$

#### (a) Assuming that the row and column numbers correspond to nodes(enumerate from 1 to 5), calculate the probability that a random walker starting in node 1 will traverse the following sequence of nodes: $\left( 1, 2, 3, 2, 4, 1, 5, 1, 2 \right)$

In [29]:
import matplotlib.pyplot as plt
import multiprocessing as mp
import pandas as pd
import pathpy as pp
from scipy.stats import rankdata
from functools import partial
import numpy as np
from tqdm import tqdm
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.graph_objects import Figure
import scipy as sp

plt.style.use('default')
sns.set_style("whitegrid")

In [30]:
ntw_adj_matrix = np.array([[0, 1, 0, 1, 1], [1, 0, 1, 2, 0], [0, 1, 0, 0, 0], [1, 2, 0, 0, 0], [1, 0, 0, 0, 0]])
ntw_adj_matrix

array([[0, 1, 0, 1, 1],
       [1, 0, 1, 2, 0],
       [0, 1, 0, 0, 0],
       [1, 2, 0, 0, 0],
       [1, 0, 0, 0, 0]])

In [12]:
transition_matrix = ntw_adj_matrix / ntw_adj_matrix.sum(axis=1, keepdims=True)
transition_matrix

array([[0.        , 0.33333333, 0.        , 0.33333333, 0.33333333],
       [0.25      , 0.        , 0.25      , 0.5       , 0.        ],
       [0.        , 1.        , 0.        , 0.        , 0.        ],
       [0.33333333, 0.66666667, 0.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ]])

In [13]:
transition_way = (1, 2, 3, 2, 4, 1, 5, 1, 2)
transition_way

(1, 2, 3, 2, 4, 1, 5, 1, 2)

In [20]:
r = 1
for index, every_position in enumerate(transition_way):
    if index == 0:
        pass
    else:
        r *= transition_matrix[transition_way[index - 1] - 1, every_position - 1]
r

0.0015432098765432098

#### (b) Does a unique stationary distribution exist for a random walk in graph G? If yes, calculate the stationary distribution, if not explain why.

In [21]:
np.linalg.matrix_power(transition_matrix, 2)

array([[0.52777778, 0.22222222, 0.08333333, 0.16666667, 0.        ],
       [0.16666667, 0.66666667, 0.        , 0.08333333, 0.08333333],
       [0.25      , 0.        , 0.25      , 0.5       , 0.        ],
       [0.16666667, 0.11111111, 0.16666667, 0.44444444, 0.11111111],
       [0.        , 0.33333333, 0.        , 0.33333333, 0.33333333]])

In [22]:
np.linalg.matrix_power(transition_matrix, 3)

array([[0.11111111, 0.37037037, 0.05555556, 0.28703704, 0.17592593],
       [0.27777778, 0.11111111, 0.16666667, 0.38888889, 0.05555556],
       [0.16666667, 0.66666667, 0.        , 0.08333333, 0.08333333],
       [0.28703704, 0.51851852, 0.02777778, 0.11111111, 0.05555556],
       [0.52777778, 0.22222222, 0.08333333, 0.16666667, 0.        ]])

In [23]:
np.linalg.matrix_power(transition_matrix, 4)

array([[0.36419753, 0.28395062, 0.09259259, 0.22222222, 0.03703704],
       [0.21296296, 0.51851852, 0.02777778, 0.14814815, 0.09259259],
       [0.27777778, 0.11111111, 0.16666667, 0.38888889, 0.05555556],
       [0.22222222, 0.19753086, 0.12962963, 0.35493827, 0.09567901],
       [0.11111111, 0.37037037, 0.05555556, 0.28703704, 0.17592593]])

The transition matrix is irreducable. So at least this transition matrix is aperiodic. This is the first step for prooving that the matrix is aperiodic. We found that at m = 4 the matrix power has only elements > 0 so the 4 is the aperiodic class. If we would like to find a period - we will need to do it before 4th power. But we have not found it.

#### (c) What is the relative frequency at which we expect the sequence of nodes $(1, 2)$ to appear in an infnite random walk on graph G?

The relative frequency is $\frac{w_{12}}{\sum w_{ij}}$

In [28]:
transition_matrix[0, 1] / np.sum(transition_matrix)

0.06666666666666667

### Random Walks and Node Centralities

Consider a small directed network in the netzschleuder (e. g. the highschool data). Check whether a unique stationary distribution exists. If necessary ensure that the network is aperiodic.

#### (a) Compute the stationary visitation probabilities of nodes and rank the nodes based on those probabilities. Compare the resulting ranking with a ranking based on the in- or out-degree of nodes or other path-based measures. Explain what you observe.

In [2]:
pp.io.graphtool.read_netzschleuder_record("highschool")

{'title': 'Illinois high school students (1958)',
 'description': 'A network of friendships among male students in a small high school in Illinois from 1958. An arc points from student i to student j if i named j as a friend, in either of two identical surveys (from Fall and Spring semesters). Edge weights are the number of surveys in which the friendship was named.[^icon]\n[^icon]: Description obtained from the [ICON](https://icon.colorado.edu) project.',
 'citation': [['J. S. Coleman. "Introduction to Mathematical Sociology." London Free Press Glencoe (1964)',
   'http://www.abebooks.com/Introduction-Mathematical-Sociology-COLEMAN-James-S/189127582/bd']],
 'bibtex': ['@article{coleman1964introduction,\n  title={Introduction to mathematical sociology.},\n  author={Coleman, James Samuel and others},\n  journal={Introduction to mathematical sociology.},\n  year={1964}\n}\n'],
 'url': 'http://konect.cc/networks/moreno_highschool',
 'restricted': False,
 'tags': ['Social', 'Offline', 'Wei

In [3]:
highschool_direct = pp.io.graphtool.read_netzschleuder_network("highschool")
highschool_direct.directed

True

In [24]:
a = [3,4,1]
sorted(a)

[1, 3, 4]

In [25]:
def rank_nodes(centralities_dict):
    rank_node_id_pairs = {index:node_id_centr_val[0] for index, node_id_centr_val in enumerate(sorted(centralities_dict.items(), key=lambda pair: -pair[1]))}
    result = [rank for rank, node_id in sorted(rank_node_id_pairs.items(), key=lambda pair: pair[1])]
    return np.array(result)

In [38]:
random_walk_direct = pp.processes.RandomWalk(highschool_direct)
stat_distr_direct = random_walk_direct.stationary_state()
stat_centralities_dict = dict(enumerate(stat_distr_direct))
rank_stat_centr_directed = rank_nodes(stat_centralities_dict)

In [39]:
in_degrees_dict = highschool_direct.indegrees()
in_degrees_ranks = rank_nodes(in_degrees_dict)

In [40]:
out_degrees = highschool_direct.outdegrees()
out_degrees_ranks = rank_nodes(out_degrees)

In [41]:
df = pd.DataFrame.from_dict({
    "in_degree vs stationary state": np.abs(rank_stat_centr_directed - in_degrees_ranks),
    "out_degree vs stationary state": np.abs(rank_stat_centr_directed - out_degrees_ranks)
})
df.head()

Unnamed: 0,in_degree vs stationary state,out_degree vs stationary state
0,50,19
1,55,55
2,13,30
3,38,3
4,1,9


In [42]:
fig = make_subplots()

fig.add_traces([
    go.Histogram(x=df["in_degree vs stationary state"], name="Delta with in_degree ranking", nbinsx=20),
    go.Histogram(x=df["out_degree vs stationary state"], name="Delta with out_degree ranking", nbinsx=20)
])
fig.show()


We can consider that differences with in_degree are much smaller then with out degree.

#### (b) Make the network undirected and compare the ranking of nodes based on stationary visitation probabilities to a degree-based ranking. Can you explain your observation in an analytical way?

In [43]:
highschool_undirect = pp.io.to_network(pp.io.to_dataframe(highschool_direct), directed=False, multiedges=False)
highschool_undirect.directed



False

In [44]:
random_walk_undirect = pp.processes.RandomWalk(highschool_undirect)
stat_distr_undirect = random_walk_undirect.stationary_state()
stat_centr_undirect_dict = dict(enumerate(stat_distr_undirect))
rank_stat_centr_undirected = rank_nodes(stat_centr_undirect_dict)

In [45]:
degrees_dict = highschool_undirect.degrees()
degrees_ranks = rank_nodes(degrees_dict)

In [46]:
df1 = pd.DataFrame.from_dict({
    "degree vs stationary state": np.abs(rank_stat_centr_undirected - degrees_ranks)
})
df1

Unnamed: 0,degree vs stationary state
0,2
1,39
2,17
3,5
4,39
...,...
65,12
66,15
67,18
68,49


In [47]:
fig = make_subplots()

fig.add_traces([
    go.Histogram(x=df1["degree vs stationary state"], name="Delta with degree ranking", nbinsx=70)
])
fig.show()

The stationary state is proportional relation to the degree in undirected network because $A_{ij} = A_{ji}$, but at the same time when we get directed network the equation is broken and stationary distribution is dependent on the in_degree.

### 3. Diffusion speed in networks

Consider microstates generated by the (one-dimensional) Watts-Strogatz model with parameters $n = 100$, $s = 3$, and three rewiring probabilities $p = 0$, $p = 0.1$, and $p = 1$.

#### (a) Use the transition matrix to compute the stationary distribution $\vec{\pi}$ of a random walk on these three microstates.
Hint: use numpy or scipy functions to calculate eigenvectors

In [62]:
def trans_matrix(network: pp.Network, weight=True):
    adj_matrix = pp.algorithms.adjacency_matrix(network, weight=weight)
    diags = sp.sparse.diags(1/adj_matrix.sum(axis=1).A.ravel())
    return (diags @ adj_matrix).toarray()

def stationary_state(network: pp.Network, weight=True):
    t_matrix = trans_matrix(network, weight)
    eigenvalues, eigenvectors = sp.linalg.eig(t_matrix, left=True, right=False)

    pi = eigenvectors[:, np.argsort(-eigenvalues)][:, 0]

    return pi / sum(pi)

In [63]:
n = pp.generators.Watts_Strogatz(30, 5, 0.1)
stationary_state(n)

array([0.03333333, 0.03666667, 0.03      , 0.03333333, 0.03666667,
       0.03333333, 0.03666667, 0.03333333, 0.03666667, 0.03      ,
       0.03333333, 0.03333333, 0.03333333, 0.03333333, 0.03      ,
       0.03      , 0.03666667, 0.03333333, 0.03      , 0.03      ,
       0.03      , 0.03333333, 0.03666667, 0.03666667, 0.03333333,
       0.03666667, 0.02333333, 0.03333333, 0.04      , 0.03333333])

In [74]:
n = 100
s = 3
p_values = [0, 0.1, 1]

In [75]:
def gen_netw(p: float):
    netw = pp.generators.Watts_Strogatz(n, s, p)
    return stationary_state(netw)

In [76]:
with mp.Pool(mp.cpu_count()) as pool:
    results = np.array([i for i in tqdm(pool.imap_unordered(gen_netw, p_values), total=len(p_values), desc="generate options")])

generate options: 100%|██████████| 3/3 [00:00<00:00, 35.26it/s]


In [80]:
df = pd.DataFrame(
    results.T,
    columns=["p = 0", "p = 0.1", "p = 1"]
)
df.head()

Unnamed: 0,p = 0,p = 0.1,p = 1
0,0.008333+0.000000j,0.01-0.00j,0.005000+0.000000j
1,0.010000+0.000000j,0.01-0.00j,0.008333+0.000000j
2,0.008333+0.000000j,0.01-0.00j,0.006667+0.000000j
3,0.011667+0.000000j,0.01-0.00j,0.010000+0.000000j
4,0.010000+0.000000j,0.01-0.00j,0.008333+0.000000j


#### (b) Assume that $\vec{\pi}^{(0)}$ is a random initial distribution of a random walk, where we assign probability one to a randomly chosen node. We further assume that $\vec{\pi}^{(t)}$ denotes the visitation probabilities of nodes after t steps. Write a python function that computes the total variation distance between two distributions $\vec{\pi}$ and $\vec{\pi}^{(t)}$. Use your function to calculate the total variation distances $\vec{\pi} - \vec{\pi}^{(t)}$ between the stationary distribution and the visitation probabilities after t steps for different values of t and 50 different random initial distributions. Repeat your experiment for the three microstates mentioned above and plot the evolution of the average total variation distance to the stationary distribution over time. In which network is the speed of convergence to the stationary distribution largest? Which one exhibits the slowest convergence speed?

In [120]:
def tvd(p1, p2):
    assert len(p1) == len(p2)
    tvd = 0
    for i in range(len(p1)):
        tvd += abs(p1[i] - p2[i])
    return tvd/2

def visitation_probabilities(network, initial_dist, t):
    T = trans_matrix(network)
    p_t = np.dot(initial_dist, np.linalg.matrix_power(T, t))
    return np.squeeze(np.asarray(p_t))

def get_netw_visit_probs(sample_idx: int, p: float, max_t: int):
    new_netw = pp.generators.Watts_Strogatz(n, s, p)
    stationary_distribution = stationary_state(new_netw)
    start_distribution = np.zeros(new_netw.number_of_nodes())
    start_distribution[np.random.randint(low=0, high=new_netw.number_of_nodes() - 1)] = 1
    return np.array([tvd(stationary_distribution, visitation_probabilities(new_netw, start_distribution, every_time)) for every_time in range(1, max_t)])

def experiment(max_time: int, p_value: float, samples=50):

    gen_netw_func = partial(get_netw_visit_probs, p=p_value, max_t=max_time)

    with mp.Pool(mp.cpu_count()) as pool:
        results_to_mean = np.array([i for i in tqdm(pool.imap_unordered(gen_netw_func, range(samples)), total=samples, desc=f"Generate options for p = {p_value}")])

    return np.mean(results_to_mean, axis=0)

In [121]:
t = 100

In [122]:
final_results = np.array([experiment(t, every_p) for every_p in tqdm(p_values, desc="Run experiment for each p value")])



Run experiment for each p value:   0%|          | 0/3 [00:00<?, ?it/s][A


Generate options for p = 0:   0%|          | 0/50 [00:00<?, ?it/s][A[A[A


Generate options for p = 0:   2%|▏         | 1/50 [00:04<03:20,  4.09s/it][A[A[A


Generate options for p = 0:   8%|▊         | 4/50 [00:04<00:38,  1.21it/s][A[A[A


Generate options for p = 0:  14%|█▍        | 7/50 [00:04<00:17,  2.45it/s][A[A[A


Generate options for p = 0:  24%|██▍       | 12/50 [00:04<00:07,  5.20it/s][A[A[A


Generate options for p = 0:  30%|███       | 15/50 [00:05<00:06,  5.07it/s][A[A[A


Generate options for p = 0:  34%|███▍      | 17/50 [00:08<00:16,  1.97it/s][A[A[A


Generate options for p = 0:  38%|███▊      | 19/50 [00:08<00:12,  2.47it/s][A[A[A


Generate options for p = 0:  42%|████▏     | 21/50 [00:08<00:09,  3.18it/s][A[A[A


Generate options for p = 0:  46%|████▌     | 23/50 [00:08<00:07,  3.61it/s][A[A[A


Generate options for p = 0:  50%|█████     | 25/50 [00:09<00:05, 

In [123]:
final_df = pd.DataFrame(
    final_results.astype("float").T,
    columns=["p = 0", "p = 0.1", "p = 1"]
)
final_df

Unnamed: 0,p = 0,p = 0.1,p = 1
0,0.940000,0.941733,9.504667e-01
1,0.870000,0.836467,7.539524e-01
2,0.820741,0.702015,4.141312e-01
3,0.786852,0.632576,3.120223e-01
4,0.764475,0.565911,1.943969e-01
...,...,...,...
94,0.265364,0.001926,2.316600e-14
95,0.262883,0.001830,1.676364e-14
96,0.260425,0.001740,1.254205e-14
97,0.257991,0.001653,9.120916e-15


In [124]:
fig = make_subplots()
fig.add_traces(
    [
        go.Scatter(x=final_df.index, y=final_df["p = 0"], name="rewire probability p = 0"),
        go.Scatter(x=final_df.index, y=final_df["p = 0.1"], name="rewire probability p = 0.1"),
        go.Scatter(x=final_df.index, y=final_df["p = 1"], name="rewire probability p = 1")
    ]
)
fig.show()

Fot the network with rewire probability $p = 1$ the diffusion speed is the highest and for the network with the rewire probability $p = 0$ the diffusions speed is the lowest