## RARE-X layer: Monoplex network of disease - patient - symptoms associations

This code extracts disease - patient and patient - symptoms associations from RARE-X anonymized data (file `data/Survey_Symptoms_US.tsv`). The network contains three types of nodes: disease, patient and symptom nodes, and two types of edges: disease-patient and patient-symptom edges. The edges are weighted:
* Each connection between diseases and patients carries a uniform weight of 1.
* Relationships between patients and CSHQ symptoms are assigned weights based on the normalized CSHQ score.
* When dealing with binary patient-symptom links (presence or absence), only symptoms marked as present are considered, each assigned a weight of 1.

The code generates the following monoplex network: ```network/multiplex/RARE_X/RARE_X_layer_weighted.tsv```


### 1. Load data

In [1]:
# import modules
import pandas as pd
import numpy as np

In [2]:
# read the data file
df = pd.read_csv('../data/Survey_Symptoms_US.tsv',sep='\t', header=0, index_col=0)
df.head()

# compute number of diseases, patients, symptoms and scores
nb_diseases = (len(df["Disease Name"].unique()))
print(f"{nb_diseases} diseases")
print(f"{df.shape[0]} patients")
print(f"{len(df.columns[0:214])} symptoms")
print(f"{len(df.columns[214:222])} CSHQ scores")

27 diseases
741 patients
214 symptoms
8 CSHQ scores


### 2. Store disease-patient associations in dictionary

Disease-patient associations are stored into a dictionnary with the function `build_dico_diseases_patients`.

In [3]:
def build_dico_diseases_patients(df: pd.DataFrame) -> dict():
    """
    This function creates a dictionary of the diseases described in the 
    RARE-X data file, and their associated patients

    Args:
        df (pd.DataFrame): dataframe containing the loaded RARE-X
        data file
    
    Returns:
        dict: a dictionay with the keys being the diseases described
        in the data file and the values being the patients associated
        with each disease
    """
    dico_diseases_patients = dict()
    i = 0
    for index, row in df.iterrows():
        if not row["Disease Name"] in dico_diseases_patients.keys():
            dico_diseases_patients[str(row["Disease Name"])] = [index]
        else:
            dico_diseases_patients[str(row["Disease Name"])] += [index]
        i += 1
    return(dico_diseases_patients)     

dico_diseases_patients = build_dico_diseases_patients(df=df)
dico_diseases_patients

{'Kleefstra syndrome': [0,
  20,
  24,
  26,
  27,
  34,
  35,
  122,
  184,
  371,
  375,
  391,
  393,
  397,
  403,
  410,
  424,
  452,
  456,
  458,
  464,
  465,
  469,
  480,
  483,
  489,
  568,
  571,
  573,
  586,
  588,
  589,
  624,
  631,
  632,
  633,
  638,
  643,
  650,
  653,
  657,
  664,
  668,
  671,
  674,
  689,
  690,
  691,
  692,
  694,
  695,
  725,
  726,
  727,
  731,
  739],
 'Koolen-de Vries Syndrome': [1,
  2,
  22,
  23,
  30,
  32,
  33,
  54,
  151,
  193,
  199,
  268,
  322,
  347,
  507,
  585],
 'CHD2 related disorders': [3,
  43,
  70,
  84,
  181,
  230,
  293,
  295,
  386,
  398,
  413,
  439,
  466,
  492,
  554,
  583,
  584,
  595,
  597,
  600,
  611,
  612,
  613,
  616,
  617,
  619,
  621,
  630,
  635,
  641,
  642,
  651,
  652,
  654,
  660,
  672,
  673,
  677,
  678,
  679,
  680,
  681,
  682,
  684,
  685,
  729,
  732],
 'FOXP1 Syndrome': [4,
  8,
  9,
  14,
  15,
  16,
  31,
  49,
  50,
  52,
  53,
  59,
  60,
  69,
  74,
  81,


### 3. Create mapping functions to map the Rare-X diseases names to Orphanet names

The Rare-X diseases need to be mapped to the Orphanet names. This mapping has been done manually. Note that only 15 of the 27 Rare-X diseases can be mapped to OrphaCodes.

The orignial Rare-X disease - Orphanet mapping from wich we extract the correspondances can be found in file `data/Diseases_Rx_orpha_corres.csv`. 

The mapping is then stored as a dictonary by using the function `create_mapping_file_diseases_names`.

In [4]:
def create_mapping_file_diseases_names(mapping_file: str) -> dict:
    """This function creates a dictionary used for the mapping
    of diseases names in the RARE-X data file to ORPHANET names

    Args:
        mapping_file (str): the file containing the mapping of
        diseases idenfifiers

    Returns:
        dict: a dictionary with keys being the diseases described in 
        the RARE-X data file and the values being their mapping 
        according to ORPHANET. If there is no corresponging 
        ORPHANET identifier, there is no mapping and the value
        in the dictionary is set to "None"
    """
    dico_mapping = dict()
    mapping = pd.read_csv(mapping_file, sep=";", header=0)
    for index, row in mapping.iterrows():
        dico_mapping[str(row[0])] = row[1]
    return dico_mapping

dico_mapping = create_mapping_file_diseases_names(mapping_file="../data/Diseases_Rx_orpha_corres.csv")
for disease_rarex, disease_orpha in dico_mapping.items():
    print(f"{disease_rarex} : {disease_orpha} \n")

4H Leukodystrophy : 4H leukodystrophy 

8p-related disorders : 8p inverted duplication/deletion syndrome 

AHC (Alternating Hemiplegia of Childhood) : Alternating hemiplegia of childhood 

ARHGEF9-related disorders : None 

CACNA1A related disorders : None 

CASK-Related Disorders : None 

CHAMP1 related disorders : None 

CHD2 related disorders : None 

CHOPS Syndrome : Cognitive impairment-coarse facies-heart defects-obesity-pulmonary involvement-short stature-skeletal dysplasia syndrome 

Classic homocystinuria : Classic homocystinuria 

DYRK1A Syndrome : DYRK1A-related intellectual disability syndrome 

FAM177A1 Associated Disorder : None 

FOXP1 Syndrome : Intellectual disability-severe speech delay-mild dysmorphism syndrome 

HUWE1-related disorders : None 

KDM5C-related disorders : KDM5C-related syndromic X-linked intellectual disability 

Kleefstra syndrome : Kleefstra syndrome 

Koolen-de Vries Syndrome : Koolen-De Vries syndrome 

Malan Syndrome : Malan overgrowth syndrome 


We also define a `map_disease_name` function to map Rare-X disease names to Orphanet names.

In [5]:
def map_disease_name(name_to_map: str, mapping_dict: dict) -> str:
    """This function maps a disease name from the RARE-X data file
    to its corresponding ORPHANET name from the mapping dictionary

    Args:
        name_to_map (str): the name of the disease to map
        mapping_dict (dict): the mapping dictionary

    Returns:
        str: the ORPAHNET name of the disease if it exists, the 
        original name else
    """
    return mapping_dict[name_to_map] if mapping_dict[name_to_map] != "None" else name_to_map

### 4. Build the Patient-Disease network

Now that we have:
* Patient - (Rare-X) Diseases associations in variable `dico_diseases_patients`,
* Rare-X Diseases - Orphanet Diseases correspondance in variable `dico_mapping`,
we can construct the Patient-Disease networkd using the function `build_disease_patient_network`. The output is a pandas dataframe containing all Patient-Disease associations.

Note that:
* For Rare-X diseases having a mapping in Orphanet we use the Orphanet name of the disease, istead of the Rare-X name.

In [6]:
def build_disease_patient_network(dico_diseases_patients: dict, mapping_dict: dict):
    """This function allows to build the network of disease-patient associations.

    Args:
        dico_diseases_patients (dict): the dictionary of disease-patient associations
        mapping_dict (dict): the mapping dictionary
    
    Returns:
        Pandas dataframe
    """
    network = pd.DataFrame(columns=["Source", "Target"])
    i = 0
    for disease in dico_diseases_patients.keys():
        disease_orpha_name = map_disease_name(name_to_map=disease, mapping_dict=mapping_dict)
        for patient in dico_diseases_patients[disease]:
            network._set_value(i, "Source", disease_orpha_name)
            network._set_value(i, "Target", patient)
            i += 1
    return network

disease_patient = build_disease_patient_network(dico_diseases_patients=dico_diseases_patients, 
                                                mapping_dict=dico_mapping)
disease_patient

Unnamed: 0,Source,Target
0,Kleefstra syndrome,0
1,Kleefstra syndrome,20
2,Kleefstra syndrome,24
3,Kleefstra syndrome,26
4,Kleefstra syndrome,27
...,...,...
736,ARHGEF9-related disorders,508
737,ARHGEF9-related disorders,511
738,ARHGEF9-related disorders,517
739,ARHGEF9-related disorders,522


We add weights to the disease-patient associations, and set it all to `1`.

In [7]:
disease_patient['Weight'] = 1
disease_patient

Unnamed: 0,Source,Target,Weight
0,Kleefstra syndrome,0,1
1,Kleefstra syndrome,20,1
2,Kleefstra syndrome,24,1
3,Kleefstra syndrome,26,1
4,Kleefstra syndrome,27,1
...,...,...,...
736,ARHGEF9-related disorders,508,1
737,ARHGEF9-related disorders,511,1
738,ARHGEF9-related disorders,517,1
739,ARHGEF9-related disorders,522,1


### 5. Build the Patient-Symptom network

Now, we aim to create a Patient-Symptom network. Within this process, the `NA` and any absent values are treated as indicators of the symptom's absence. Conversely, associations between Patients and Symptoms are assigned a weight of `1` for symptoms that are detected. Meanwhile, the normalized CSHQ scores are employed as the weighting factors for CSHQ symptoms.

First, we normalize CSHQ scores by max value:

In [8]:
df['CSHQ Subscale 1: Bedtime Resistance']= (df['CSHQ Subscale 1: Bedtime Resistance'])/(18)
df['CSHQ Subscale 2: Sleep onset Delay']= (df['CSHQ Subscale 2: Sleep onset Delay'])/(3)
df['CSHQ Subscale 3: Sleep Duration']= (df['CSHQ Subscale 3: Sleep Duration'])/(9)
df['CSHQ Subscale 4: Sleep Anxiety']= (df['CSHQ Subscale 4: Sleep Anxiety'])/(12)
df['CSHQ Subscale 5: Night Wakings']= (df['CSHQ Subscale 5: Night Wakings'])/(9)
df['CSHQ Subscale 6: Parasomnias']= (df['CSHQ Subscale 6: Parasomnias'])/(21)
df['CSHQ Subscale 7: Sleep Disordered Breathing']= (df['CSHQ Subscale 7: Sleep Disordered Breathing'])/(9)
df['CSHQ Subscale 8: Daytime Sleepiness']= (df['CSHQ Subscale 8: Daytime Sleepiness'])/(24)

Now, we can parse the data frame and extract weighted Patient-Symptoms associations. For boolean symptoms (present/absent), we only extract symptoms that are present, and set the weight to 1. For CSHQ symptoms, we use the normalized CSHQ score as a weight. 

In [9]:
# Dropping the disease column
symp_df = df.drop(df.columns[222], axis=1)

# Initialize an empty list to store the results
result_rows = []

for idx, row in symp_df.iterrows():
    for column_name, value in row.items():
        if not np.isnan(value) and value!=0:
            result_rows.append([idx, column_name, value])

# Resulting weighted Patient-Symptom network
patient_symptom = pd.DataFrame(result_rows, columns=['Source', 'Target', 'Weight'])
patient_symptom

Unnamed: 0,Source,Target,Weight
0,0,Tall_Stature_Symptom_Present,1.0
1,0,Heart_Defect_Symptom_Present,1.0
2,0,Decreased_Sweating_Symptom_Present,1.0
3,0,Cognitive_Impairment_Symptom_Present,1.0
4,0,Coordination_Issues_Symptom_Present,1.0
...,...,...,...
10387,740,Constipation_Symptom_Present,1.0
10388,740,Esophagus_Issues_Symptom_Present,1.0
10389,740,Coordination_Issues_Symptom_Present,1.0
10390,740,Hypotonia_Symptom_Present,1.0


### 6. Concatenate the Patient-Disease network and the Patient-Symptom network

The final Rare-X network is created by concatenating the Patient-Disease and Patient-Symptoms associations:

In [10]:
rarex_network = pd.concat([disease_patient, patient_symptom], ignore_index=True)
rarex_network

Unnamed: 0,Source,Target,Weight
0,Kleefstra syndrome,0,1.0
1,Kleefstra syndrome,20,1.0
2,Kleefstra syndrome,24,1.0
3,Kleefstra syndrome,26,1.0
4,Kleefstra syndrome,27,1.0
...,...,...,...
11128,740,Constipation_Symptom_Present,1.0
11129,740,Esophagus_Issues_Symptom_Present,1.0
11130,740,Coordination_Issues_Symptom_Present,1.0
11131,740,Hypotonia_Symptom_Present,1.0


We save this network in file ```network/multiplex/RARE_X/RARE_X_layer_weighted.tsv```

In [11]:
rarex_network.to_csv("../network/multiplex/RARE_X/RARE_X_layer_weighted.tsv", sep="\t", header=None, index=False)