In [1]:
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import cdt
import os

No GPU automatically detected. Setting SETTINGS.GPU to 0, and SETTINGS.NJOBS to cpu_count.


Before comparing graphs, a common algorithm for learning graphs needs to be selected. There can be two ways of doing this - 

1. Contstructing a gene-gene interaction network using KEGG, TRRUST and CTD databases and compare that with the graphs constructed for healthy controls using different algorithms. The algorithm used to learn the graph that in a sense is closest to the CTD graph should be selected

2. Pick the algorithm that makes the most sense or has is the most meaningful in terms of the kind of data for which graphs are being learnt for.

In this script, the focus is on the graphs generated using the PC algorithm, selected using point 2

In [2]:
path = os.getcwd() + '/causal graphs/'
xls = pd.ExcelFile(path + 'PC_graphs.xlsx')

pc_graphs = {}
for sheet in xls.sheet_names:
    adjacency_df = pd.read_excel(path+'PC_graphs.xlsx', sheet_name = sheet)
    adjacency_df.rename(columns = {'Unnamed: 0':'genes'}, inplace = True)
    adjacency_df.index = adjacency_df.iloc[:,0]
    adjacency_df.drop(columns = ['genes'], inplace = True)
    pc_graphs[sheet] = nx.from_pandas_adjacency(adjacency_df, create_using=nx.DiGraph())

## Structural Hamming distance

In this section, structural hamming distance is calculated for all possible pairs of diseases with the emphasis on distance from "healthy" control network

In [3]:
pc_graphs.keys()

dict_keys(['Adenovirus_Simplex_virus', 'Dengue', 'Influenza', 'Paraflu_RespSyncytial', 'Pneumonia', 'Rhinovirus', 'healthy_ctrl', 'Critical', 'Non-critical'])

In [13]:
from cdt.metrics import SHD
from cdt.metrics import SID

hamming_df = pd.DataFrame(columns = list(pc_graphs.keys()), index = list(pc_graphs.keys()))
intervention_df = pd.DataFrame(columns = list(pc_graphs.keys()), index = list(pc_graphs.keys()))

for col in hamming_df.columns.to_list():
    for idx in hamming_df.index.to_list():
        hamming_df.loc[idx, col] = SHD(pc_graphs[idx], pc_graphs[col], \
                                       double_for_anticausal = True)
        intervention_df.loc[idx, col] = SID(pc_graphs[idx], pc_graphs[col])

In [15]:
intervention_df

Unnamed: 0,Adenovirus_Simplex_virus,Dengue,Influenza,Paraflu_RespSyncytial,Pneumonia,Rhinovirus,healthy_ctrl,Critical,Non-critical
Adenovirus_Simplex_virus,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0
Dengue,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0
Influenza,2660.0,2659.0,2507.0,2657.0,2656.0,2657.0,2659.0,2659.0,2657.0
Paraflu_RespSyncytial,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0,152.0
Pneumonia,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0
Rhinovirus,168.0,168.0,168.0,168.0,168.0,165.0,168.0,168.0,168.0
healthy_ctrl,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0
Critical,1508.0,1494.0,1543.0,1533.0,1521.0,1490.0,1510.0,1032.0,1526.0
Non-critical,222.0,222.0,222.0,222.0,222.0,222.0,222.0,222.0,222.0
