# Practice Session 03: Management of network data

In this session we will study an application of complex networks analysis to medicine. We will start with the *diseasome*, a bi-partite network connecting all known genetic diseases with genes whose mutations are implicated in that disease [1].

The initial dataset `disease-genes.csv` in the data/ directory contains the following columns:

1. A disorder **ID**
2. A disorder **Name**
3. A comma-separated list of **Genes** involved in this disorder
4. The **OMIM ID** (Online Mendelian Inheritance in Man) of this disorder
5. A codification of the location of the genes in their **Chromosome**
6. A disorder **Class** indicating the physiological system that is affected

[1] Goh, K. I., Cusick, M. E., Valle, D., Childs, B., Vidal, M., & Barabási, A. L. (2007). [The human disease network](http://www.pnas.org/content/104/21/8685). Proceedings of the National Academy of Sciences, 104(21), 8685-8690.

Author: <font color="blue">Jose Giner</font>

E-mail: <font color="blue">joseginer67@gmail.com</font>

Date: <font color="blue">15/02/2022</font>

# 1. The diseasome bi-partite graph

In [5]:
# Feel free to add imports if you need them
import io
import csv
import pandas as pd

In [2]:
# Leave this code as-is

INPUT_FILENAME = "disease-genes.csv"
OUTPUT_DISEASOME_FILENAME = "diseasome.csv"

In [3]:
# Leave this code as-is

disease_genes = pd.read_csv('C:\\Users\\Jose Giner\OneDrive\\Escritorio\\SNA\\P3\\'+ INPUT_FILENAME, sep=",")
disease_genes.set_index("ID", inplace=True)
# Ten first lines to check if the object has the right type of data in it.
disease_genes.head(10)

Unnamed: 0_level_0,Name,Genes,OMIM ID,Chromosome,Class
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"17,20-lyase deficiency, isolated","CYP17A1, CYP17, P450C17",609300,10q24.3,Endocrine
1,"17-alpha-hydroxylase/17,20-lyase deficiency","CYP17A1, CYP17, P450C17",609300,10q24.3,Endocrine
3,2-methyl-3-hydroxybutyryl-CoA dehydrogenase de...,"HADH2, ERAB",300256,Xp11.2,Metabolic
4,2-methylbutyrylglycinuria,ACADSB,600301,10q25-q26,Metabolic
5,"3-beta-hydroxysteroid dehydrogenase, type II, ...",HSD3B2,201810,1p13.1,Metabolic
6,3-hydroxyacyl-CoA dehydrogenase deficiency,"HADHSC, SCHAD",601609,4q22-q26,Metabolic
7,3-Methylcrotonyl-CoA carboxylase 1 deficiency,"MCCC1, MCCA",609010,3q25-q27,Metabolic
7,3-Methylcrotonyl-CoA carboxylase 2 deficiency,"MCCC2, MCCB",609014,5q12-q13,Metabolic
8,"3-methylglutaconic aciduria, type I",AUH,600529,Chr.9,Metabolic
9,"3-methylglutaconicaciduria, type III","OPA3, MGA3",606580,19q13.2-q13.3,Metabolic


In [4]:
diseasome = {'disorder':  disease_genes['Name'],
        'class': disease_genes['Class'],
             'gene_list' : disease_genes['Genes']
        }

diseasome = pd.DataFrame(diseasome)
diseasome.reset_index(inplace=True) # Resets the index, makes factor a column
diseasome.drop("ID",axis=1,inplace=True) # drop factor from axis 1 and make changes permanent by inplace=True
diseasome.head(3)

Unnamed: 0,disorder,class,gene_list
0,"17,20-lyase deficiency, isolated",Endocrine,"CYP17A1, CYP17, P450C17"
1,"17-alpha-hydroxylase/17,20-lyase deficiency",Endocrine,"CYP17A1, CYP17, P450C17"
2,2-methyl-3-hydroxybutyryl-CoA dehydrogenase de...,Metabolic,"HADH2, ERAB"


In [46]:
nrows = diseasome.shape[0]
disorder_genes = [] 
for i in range(nrows):
    d = diseasome["disorder"].iloc[i]
    c = diseasome["class"].iloc[i]
    gene_list = list(map(str.strip, diseasome["gene_list"].iloc[i].split(',')))
    for g in gene_list:
        disorder_genes.append((d,c,g))
        
disorders = [d[0] for d in disorder_genes]
dis_class = [d[1] for d in disorder_genes]
genes = [d[2] for d in disorder_genes]

df = pd.DataFrame({'disorder': disorders,
                  'class': dis_class,
                  'gene': genes})    

In [54]:
df.head(3)

Unnamed: 0,disorder,class,gene
0,"17,20-lyase deficiency, isolated",Endocrine,CYP17A1
1,"17,20-lyase deficiency, isolated",Endocrine,CYP17
2,"17,20-lyase deficiency, isolated",Endocrine,P450C17


In [52]:
df.to_csv(OUTPUT_DISEASOME_FILENAME,sep = '\t', index = False)

![Diseasome graph](diseasome.png) 

In this graph, we can see that diseases are linked to genes and viceversa, and there are 3 or 4 main components which are joined by disorders acting as bridge nodes. Also there are regions containing diseases of different types that are caused by the same genes. The largest component contains 274 nodes out of 6386 that compose the graph (4.3%). The dominant type of disease in the largest component is cancer with 95 nodes out of 132, aproximately a 72% of diseases. Diseases of the same type are close to each other because they are associated with similar or same genes.

# 2. The disease-disease graph

In [5]:
OUTPUT_DISEASEDISEASE_FILENAME = "disease-disease.csv"

In [6]:
def intersection(list1, list2):
    return(list(set(list1) & set(list2)))

In [7]:
d1, d2 = [] , []
ngen1 , ngen2 = [] , []
cl1 , cl2 = [] , []
com_gen = [] 
for idx1, disorder1 in diseasome.iterrows():
    gene_list_1 = list(map(str.strip, disorder1["gene_list"].split(',')))
    for idx2, disorder2 in diseasome.iterrows():
        if idx2 > idx1:
            gene_list_2 = list(map(str.strip, disorder2["gene_list"].split(',')))
            common_genes = intersection(gene_list_1, gene_list_2)
            if len(common_genes) > 0:
                d1.append(disorder1["disorder"])
                d2.append(disorder2["disorder"])
                ngen1.append(len(gene_list_1))
                ngen2.append(len(gene_list_2))
                cl1.append(disorder1['class'])
                cl2.append(disorder2['class'])
                com_gen.append(len(common_genes))
        

df2 = pd.DataFrame({'disorder1': d1,
                   'disorder2': d2,
                   'ngenes1' : ngen1,
                   'ngenes2' : ngen2,
                  'class1': cl1,
                  'class2': cl2,
                   'ngenescommon' : com_gen})    

In [17]:
df2.head(3)

Unnamed: 0,disorder1,disorder2,ngenes1,ngenes2,class1,class2,ngenescommon
0,"17,20-lyase deficiency, isolated","17-alpha-hydroxylase/17,20-lyase deficiency",3,3,Endocrine,Endocrine,3
1,"3-methylglutaconicaciduria, type III",Optic atrophy and cataract,2,2,Metabolic,Ophthamological,2
2,Aarskog-Scott syndrome,"Mental retardation, X-linked nonsyndromic",3,3,multiple,Neurological,3


In [18]:
df2.to_csv(OUTPUT_DISEASEDISEASE_FILENAME,sep = '\t', index = False)

![Largest connected component of diseases graph](disease-disease-largest-cc.png)

![Diseases graph](disease-disease.png)

In this graph we can observe the groups or disconnected components of disorders that share common genes in a direct or indirect way. It is interesting to see that the largest component only contains 132 nodes of the total nodes in the graph (that is a 8%) and that most of the small disconnected components are diseases that share at least 1 gene between them as evidence that they have all or almost all posible connections (complete subgraphs) and most of these components contain diseases of the same type. As a result, L to L max ratio seems to increase as components get smaller in size.

# DELIVER (individually)

Deliver a zip file containing:

* This notebook
* The ``diseasome.csv`` and ``diseasome.png`` files
* The ``disease-disease.csv`` and ``disease-disease.png`` files

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>
