# Data preparation – interdependency networks

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, @guerrero_oa)

In the literature related to the Sustainable Development GOals (SDGs), much attention has been given to interdependency networks between SDGs, targets, or development indicators. On of the features of PPI is its ability to take into account such networks as an exogenous variable meant to preserve certain structure in the co-movenent of the indicators. This network is considered exogenous because it is a stylised fact of the system under study, not a causal account of the relationship between the indicators. While many studies attempt at making causal claims from such objects, we have shown (in the book and in multiple publications) that such statements cannot be causal (see https://doi.org/10.1016/j.im.2020.103342 for an example). Thus, the aim in this tutorial is to simply show how to prepare the data for the network input of PPI.

In the book, as in most of PPI's studies, we have employed a method called `sparsebn` (see http://doi.org/10.18637/jss.v091.i11). However, for the sake of simplicity in these tutorials, let us employ a simple correlation approach to construct the network. First, we will lead the clean indicator data. Then, we will estimate pairwise correlations between the changes of two time series, with one of them in lagged values. This allows constructing a directed asymmetric network. Next, we will filter out edges using an arbitrary threshold criterion. Finally, we will structure the data and export it.

## Import the necessary python libraries to manipulate data

In [1]:
import pandas as pd
import numpy as np

## Import the raw development indicators

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/oguerrer/ppi/main/tutorials/clean_data/data_indicators.csv')

## Construct a matrix with pairwise Pearson correlations

The directionality of the edges is from row to column.

In [31]:
N = len(data)
M = np.zeros((N, N))
years = [column_name for column_name in data.columns if str(column_name).isnumeric()]

for i, rowi in data.iterrows():
    for j, rowj in data.iterrows():
        if i!=j:
            serie1 = rowi[years].values.astype(float)[1::]
            serie2 = rowj[years].values.astype(float)[0:-1]
            change_serie1 = serie1[1::] - serie1[0:-1]
            change_serie2 = serie2[1::] - serie2[0:-1]
            if not np.all(change_serie1 == change_serie1[0]) and not np.all(change_serie2 == change_serie2[0]):
                M[i,j] = np.corrcoef(change_serie1, change_serie2)[0,1]

## Filter edges that have a weight of magnitude lower than 0.5

In [32]:
M[np.abs(M) < 0.5] = 0

0.04147376543209876

## Save the network as a list of edges using the indicators' ids

In [5]:
ids = data.seriesCode.values
edge_list = []
for i, j in zip(np.where(M!=0)[0], np.where(M!=0)[1]):
    edge_list.append( [ids[i], ids[j]] )
df = pd.DataFrame(edge_list, columns=['origin', 'destination'])
df.to_csv('clean_data/data_network.csv', index=False)