## RARE-X layer: Monoplex network of disease - patient - symptoms associations

This code extracts disease - patient - symptoms associations from RARE-X anonymized patient data (file Survey_Symptoms_US.tsv). The network contains three types of nodes: disease, patient and symptom nodes, and two types of edges: disease-patient and patient-symptom edges. For the sake of simplicity, quantitative symptoms (i.e., CSHQ scores) are not taken into account for the construction of the network. The code generates the following monoplex network: ```network/multiplex/RARE_X/RARE_X_layer.tsv```


### 1. Load data

In [1]:
# import modules
import pandas as pd

In [2]:
# read the data file
df = pd.read_csv('../data/Survey_Symptoms_US.tsv',sep='\t', header=0, index_col=0)
df.head()

# compute number of diseases, patients, symptoms and scores
nb_diseases = (len(df["Disease Name"].unique()))
print(f"{nb_diseases} diseases")
print(f"{df.shape[0]} patients")
print(f"{len(df.columns[0:214])} symptoms")
print(f"{len(df.columns[214:222])} CSHQ scores")

27 diseases
741 patients
214 symptoms
8 CSHQ scores


### 2. Store disease-patient and patient-symptom associations in dictionaries

Disease-patient and patient-symptom associations are stored into a dictionnary. CSHQ features are qualitatives values. We create a non weigthed Rare-X layer, so we decide to remove these features.

In [3]:
def build_dico_diseases_patients(df: pd.DataFrame) -> dict():
    """
    This function creates a dictionary of the diseases described in the 
    RARE-X data file, and their associated patients

    Args:
        df (pd.DataFrame): dataframe containing the loaded RARE-X
        data file
    
    Returns:
        dict: a dictionay with the keys being the diseases described
        in the data file and the values being the patients associated
        with each disease
    """
    # remove CSHQ scores columns !!A!! I don't get why this is here, CSHQ is more related with symptoms than disease or patients, no? And why remove ?
    df = df.drop(df.columns[214:222], axis=1)
    dico_diseases_patients = dict()
    i = 0
    for index, row in df.iterrows():
        if not row["Disease Name"] in dico_diseases_patients.keys():
            dico_diseases_patients[str(row["Disease Name"])] = [index]
        else:
            dico_diseases_patients[str(row["Disease Name"])] += [index]
        i += 1
    return(dico_diseases_patients)     

dico_diseases_patients = build_dico_diseases_patients(df=df)

In [4]:
def build_dico_patients_symptoms(df: pd.DataFrame) -> dict():
    """This function creates a dictionary of patients and
    associated symtoms from the RARE-X data file

    Args:
        df (pd.DataFrame): the dataframe containing the 
        loaded RARE-X data

    Returns:
        dict: a dictionary with the keys being the patients 
        described in the data file and the values being 
        their associated symptoms
    """
    # remove CSHQ scores columns
    df = df.drop(df.columns[214:222], axis=1)
    dico_patients_symptoms = dict()
    for index, row in df.iterrows():
        symptoms = df.apply(lambda row: list(row.index[row == 1.0]), axis=1)
        for index, columns in symptoms.iteritems():
            dico_patients_symptoms[index] = columns
    return dico_patients_symptoms

dico_patients_symptoms = build_dico_patients_symptoms(df=df)

### 3. Create mapping functions to map the diseases names to ORPHANET names

The Rare-X diseases need to be mapped to the Orphanet names. This mapping has been done manually and is provided as a dictionnary. Note that only 15 of the 27 Rare-X diseases can be mapped to OrphaCodes.

In [8]:
def create_mapping_file_diseases_names(mapping_file: str) -> dict:
    """This function creates a dictionary used for the mapping
    of diseases names in the RARE-X data file to ORPHANET names

    Args:
        mapping_file (str): the file containing the mapping of
        diseases idenfifiers

    Returns:
        dict: a dictionary with keys being the diseases described in 
        the RARE-X data file and the values being their mapping 
        according to ORPHANET. If there is no corresponging 
        ORPHANET identifier, there is no mapping and the value
        in the dictionary is set to "None"
    """
    dico_mapping = dict()
    mapping = pd.read_csv(mapping_file, sep=";", header=0)
    for index, row in mapping.iterrows():
        dico_mapping[str(row[0])] = row[1]
    return dico_mapping

dico_mapping = create_mapping_file_diseases_names(mapping_file="../data/Diseases_Rx_orpha_corres.csv")
for disease_rarex, disease_orpha in dico_mapping.items():
    print(f"{disease_rarex} : {disease_orpha} \n")

4H Leukodystrophy : 4H leukodystrophy 

8p-related disorders : 8p inverted duplication/deletion syndrome 

AHC (Alternating Hemiplegia of Childhood) : Alternating hemiplegia of childhood 

ARHGEF9-related disorders : None 

CACNA1A related disorders : None 

CASK-Related Disorders : None 

CHAMP1 related disorders : None 

CHD2 related disorders : None 

CHOPS Syndrome : Cognitive impairment-coarse facies-heart defects-obesity-pulmonary involvement-short stature-skeletal dysplasia syndrome 

Classic homocystinuria : Classic homocystinuria 

DYRK1A Syndrome : DYRK1A-related intellectual disability syndrome 

FAM177A1 Associated Disorder : None 

FOXP1 Syndrome : Intellectual disability-severe speech delay-mild dysmorphism syndrome 

HUWE1-related disorders : None 

KDM5C-related disorders : KDM5C-related syndromic X-linked intellectual disability 

Kleefstra syndrome : Kleefstra syndrome 

Koolen-de Vries Syndrome : Koolen-De Vries syndrome 

Malan Syndrome : Malan overgrowth syndrome 


In [32]:
def map_disease_name(name_to_map: str, mapping_dict: dict) -> str:
    """This function maps a disease name from the RARE-X data file
    to its corresponding ORPHANET name from the mapping dictionary

    Args:
        name_to_map (str): the name of the disease to map
        mapping_dict (dict): the mapping dictionary

    Returns:
        str: the ORPAHNET name of the disease if it exists, the 
        original name else
    """
    return mapping_dict[name_to_map] if mapping_dict[name_to_map] != "None" else name_to_map

### 4. Build the network

#### a. Without the NA values

In [35]:
def build_rarex_network_without_na(dico_diseases_patients: dict, dico_patients_symptoms: dict, mapping_dict: dict, network_filename: str) -> None:
    """This function allows to build the network of disease-patient-symptom associations.

    Args:
        dico_diseases_patients (dict): the dictionary of disease-patient associations
        dico_patients_symptoms (dict): the dictionary of patient-symptom associations
        mapping_dict (dict): the mapping dictionary
        network_filename (str): the path to the generated network file
    
    Returns:
        None
    """
    network = pd.DataFrame(columns=["source", "target"])
    i = 0
    for disease in dico_diseases_patients.keys():
        disease_orpha_name = map_disease_name(name_to_map=disease, mapping_dict=mapping_dict)
        for patient in dico_diseases_patients[disease]:
            network._set_value(i, "source", disease_orpha_name)
            network._set_value(i, "target", patient)
            i += 1
    j = i+1
    for patient in dico_patients_symptoms.keys():
        for symptom in dico_patients_symptoms[patient]:
            network._set_value(j, "source", patient)
            network._set_value(j, "target", symptom)
            j += 1
    network.to_csv(network_filename, sep="\t", header=None, index=False)

build_rarex_network_without_na(
                               dico_diseases_patients=dico_diseases_patients, 
                               dico_patients_symptoms=dico_patients_symptoms, 
                               mapping_dict=dico_mapping, 
                               network_filename="../network/multiplex/RARE_X/RARE_X_layer.tsv"
                               )