## BST 267: Introduction to Social and Biological Networks (2022)

## Final Project

### Overview

In this final project we will use network data from a survey of social networks in 75 villages in rural southern Karnataka, India. This dataset was originally collected to study diffusion of microfinance. The details of the original publication are: Abhijit Banerjee, Arun G. Chandrasekhar, Esther Duflo, Matthew O. Jackson. The Diffusion of Microfinance. Science, Vol. 341 no. 6144, 2013.

We will use the network data to investigate the spread of a fictious pathogen. We will assume that these social networks correspond to contact networks for the given pathogen. In other words, if any two individuals are connected in the network, then the pathogen may spread between them.

### Question 1: Downloading and reading in village data

Your first task is to download the dataset from Harvard Dataverse. We'll work with the latest version of the dataset which is available here:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3BIHX&version=9.4

Proceed to download the data to your computer. You will be asked a couple of questions. One of them is: "What is the intended use of the files that you are requesting to download?" You should select the following option: "Student: completing an exercise or problem set designed for this dataset".

The downloaded file is a compressed zip file, so you need to unzip it first and place it in **the same directory where this notebook is located** on your computer. The folder of interest to us is called `Data`. Its subfolder `1. Network Data` contains adjacency matrices as CSV files for each village. The adjacency matrices themselves are in the folder `Adjacency Matrices`. The main folder comes with a file called `README.pdf` which you should consult as needed to get a better understanding of the data.

The code for reading in network adjacency matrices and for constructing the network objects is below. In this question, you need to do to two things: 1) add one comment line preceding each line of code (apart from the import statements) to clarify what each line does; and 2) run the code.

Note: Different operating systems have slightly different ways of specifying file paths. The code below runs as is if you're on Mac; if you're a Windows of Linux user, you may need to modify the line specifying the path.

In [1]:
# RUN THIS CODE AND ADD COMMENTS

import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

path = "./datav4.0/Data/1. Network Data/Adjacency Matrices/"
villages = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21,  \
            23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, \
            41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, \
            59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77]

Gs = {}
for k in villages:
    filename = path + "adj_allVillageRelationships_vilno_" + str(k) + ".csv"
    A = np.loadtxt(filename, delimiter=",")
    G = nx.to_networkx_graph(A)
    Gs[k] = G

### Question 2: Extracting largest connected components

Most empirical networks contain a largest connected component (LCC) that contains most network nodes. Create dictionary `LCCs` that consists of the largest connected components (graph objects) of the networks in the `Gs` dictionary. (You may find `nx.induced_subgraph` to be useful in this problem.) Make sure that the keys match across the two dictionaries. Create a dot plot where the y-axis shows the proportion of nodes in the LCC of each graph and the x-axis corresponds to the index of the village. (The village index runs from 1 to 77 but we're skipping over a couple of villages.) The ordering of the villages on the x-axis should be such that the y-axis values increase from left to right.

In [1]:
# ADD YOUR CODE HERE


### Question 3: Visualizing networks and their degree and distributions

Visualize the LCC (graph) of each network in a single figure using a layout consisting of 13x6 panels. Using `plt.figure(figsize=(20,40))` should look reasonable.

In a separate figure, in a single panel, plot the degree distribution of each LCC. To make the plot more readable, you should use a line plot and not a histogram.

In [2]:
# ADD YOUR CODE HERE


### Question 4: Simulating SIR process

Write code to simulate discrete-time SIR spreading process on the LCCs of four villages only: 1, 31, 61, and 77. Set the S to I transition probability `p_si` to 0.2 and the I to R transition probability `p_ir` to 0.1. Initially only one individual, selected uniformly at random among the LCC nodes, is infected; the rest of the LCC is susceptible. Use **unit infectivity** in the SIR simulation: each infected node, per time step, selects only one network neighbor uniformly at random among its neighbors and potentially transmits the pathogen to it. Any node, regardless of their degree, will therefore infect 0 or 1 nodes per time step.

Run the simulation 300 times for each of the four LCCs. In four separate panels, one per village, plot the proportion of individuals in each of the three states (S, I, R) against time for each simulation run. Explain your findings.

In [3]:
p_si = 0.2 # S to I transition probability
p_ir = 0.1 # I to R transition probability

# ADD YOUR CODE HERE


### Question 5: Reproduction number

Reproduction number for any infectious disease can be succinctly summarized as **the number of persons infected per person infecting**. There are many variants of this concept. The first thing to note is that this number varies during the epidemic. The **basic reproduction number $R_0$** is usually defined as the expected number of cases generated by one case in a fully susceptible population. Because the population is fully susceptible only at the very beginning of an epidemic, $R_0$ may be less informative about what happens later, especially for a network (vs. mass-action) model. The **effective reproduction number $R_t$** is usually defined as the expected number of cases generated by one case at time $t$ in a partially susceptible population. If we set time $t=0$, then $R_t$ coincides with $R_0$.

Use simulation to compute (estimate) the effective reproduction number $R_t$ as a function of time for the SIR process for the LCC of village 1 and the LCC of village 31 using at least 1000 spreading process realizations for each. Make a plot with three panels: (1) plot $R_t$ for each realization over time and mean $R_t$ over time for village 1 (left panel); (2) same for village 31 (middle panel); and (3) plot the mean $R_t$ for the two villages (two curves) over time.

In [4]:
# ADD YOUR CODE HERE


### Question 6: Individual-level randomized controlled vaccine trial

In this question, you will implement an individual-level randomized trial to assess the efficacy of a new vaccine against an infectious disease. In these trials, every individual in every village is randomized to treatment with probability 1/2 and to control with probability 1/2. On average, half of the individuals in each village will be assigned to treatment and the other half to control.

We will assume that the true S to I transmission probabilities for treated and untreated (control) individuals are known and they are `p_si_treatment` and `p_si_control`, respectively. We will assume that the I to R probability for recovery is known and is given by `p_ir`.

Write the code to simulate this clinical trial. You will need to randomize each individual in every village to treatment or control and then propagate the SIR process on each village network independently until it naturally comes to a halt. The output for each village should be the proportion of cases in the treatment group (i.e, proportion of nodes in the treatment group that were infected at some point), the proportion of cases in the control group, and the difference between the two. Your code should print out three numbers: (1) the average proportion of cases in the treatment groups (where average is taken across all villages); (2) the average proportion of cases in the control groups (where average is taken across all villages); and (3) proportion of cases in the control group - proportion of cases in the treatment group (i.e., difference in two proportions), averaged over all villages.

Note: To simplify implementation, in this question "village" refers to the LCC of the village.

In [5]:
# fix some parameters
p_ir = 0.05
p_si_treatment = 0.05
p_si_control = 0.3

# ADD YOUR CODE HERE


### Question 7: Cluster-level randomized controlled vaccine trial

In general, the **direct effect** of a vaccine refers to the protection against illness received by an individual because they themselves received the vaccine. In contrast, the **indirect effect** of a vaccine refers to the protection against illness received by an individual because others around them received the vaccine. When vaccines are rolled out at scale, we benefit from the combination of the direct and indirect effects, which is often called the **total effect**.

Cluster-randomized trials are used to study interventions for infectious diseases. The rationale for cluster-level vs. individual-level randomization is that the former can capture both direct and indirect effects of the intervention; this setting more closely corresponds to the situation of what would happen if the intervention would be rolled out at scale.

In this question, you will implement a cluster-level randomized trial, also called a cluster-randomized trial. In these trials, each village is randomized to treatment with probability 1/2 and to control with probability 1/2. The key point is that everyone in a given village receives the same treatment.

Write the code to simulate this clinical trial. Your code should print out three numbers: (1) the average proportion of cases in the treatment villages; (2) the average proportion of cases in the control villages; and (3) the difference between the two.

Provide a brief commment about the difference in results in Question 6 and Question 7.

Note: To simplify implementation, in this question "village" refers to the LCC of the village.

In [6]:
# fix some parameters
p_ir = 0.05
p_si_treatment = 0.05
p_si_control = 0.3

# ADD YOUR CODE HERE
