# Example solution and writeup for RARE-X Task 1
## Jake Albrecht, June 2023
This is a Jupyter notebook that can be used to document and run code for Task 1 of the [RARE-X OSDC](https://www.synapse.org/rarex)

In [1]:
import pandas as pd
import numpy as np
import json 
import xml.etree.ElementTree as ET

# Step 1: Read data

In [2]:
df=pd.read_csv('Survey_Symptoms_US.tsv',sep='\t')

In [3]:
# normalize CSHQ scores by max  value

df['CSHQ Subscale 1: Bedtime Resistance']= (df['CSHQ Subscale 1: Bedtime Resistance'])/(18)
df['CSHQ Subscale 2: Sleep onset Delay']= (df['CSHQ Subscale 2: Sleep onset Delay'])/(3)
df['CSHQ Subscale 3: Sleep Duration']= (df['CSHQ Subscale 3: Sleep Duration'])/(9)
df['CSHQ Subscale 4: Sleep Anxiety']= (df['CSHQ Subscale 4: Sleep Anxiety'])/(12)
df['CSHQ Subscale 5: Night Wakings']= (df['CSHQ Subscale 5: Night Wakings'])/(9)
df['CSHQ Subscale 6: Parasomnias']= (df['CSHQ Subscale 6: Parasomnias'])/(21)
df['CSHQ Subscale 7: Sleep Disordered Breathing']= (df['CSHQ Subscale 7: Sleep Disordered Breathing'])/(9)
df['CSHQ Subscale 8: Daytime Sleepiness']= (df['CSHQ Subscale 8: Daytime Sleepiness'])/(24)

Inpsect input data:


In [4]:
df.head()

Unnamed: 0,Asthma_Symptom_Present,COPD_Symptom_Present,Respitory_Abnormality_Symptom_Present,Decreased_Pulmonary_Function_Symptom_Present,Abnormal_Diaphragm_Symptom_Present,Respitory_Insufficiency_Symptom_Present,Restrictive_Lung_Disease_Symptom_Present,Abnormal_Breathing_Patterns_Symptom_Present,Abnormal_Upper_Respiratory_Symptom_Present,Abnormal_Eye_Movement_Symptom_Present,...,PELHS Wheelchair,CSHQ Subscale 1: Bedtime Resistance,CSHQ Subscale 2: Sleep onset Delay,CSHQ Subscale 3: Sleep Duration,CSHQ Subscale 4: Sleep Anxiety,CSHQ Subscale 5: Night Wakings,CSHQ Subscale 6: Parasomnias,CSHQ Subscale 7: Sleep Disordered Breathing,CSHQ Subscale 8: Daytime Sleepiness,Disease_Name
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,Kleefstra syndrome
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,,,,,,,,,,Koolen-de Vries Syndrome
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.444444,0.333333,0.333333,0.416667,0.444444,0.380952,0.444444,0.416667,Koolen-de Vries Syndrome
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.333333,0.333333,0.333333,0.416667,0.333333,0.333333,0.333333,0.375,CHD2 related disorders
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,...,0.0,0.611111,0.333333,0.666667,0.666667,0.444444,0.52381,0.444444,0.333333,FOXP1 Syndrome


Look at disease names:

In [5]:
df.Disease_Name.unique()


array(['Kleefstra syndrome', 'Koolen-de Vries Syndrome',
       'CHD2 related disorders', 'FOXP1 Syndrome',
       'Classic homocystinuria', 'DYRK1A Syndrome',
       'CHAMP1 related disorders', 'STXBP1 related Disorders',
       'Pallister-Killian mosaic syndrome (PKS)',
       'CACNA1A related disorders', 'Ring14 and related disorders',
       'Wiedemann-Steiner Syndrome (WSS)', 'SYNGAP1 related disorders',
       'CASK-Related Disorders', 'NARS1 genetic mutation',
       'FAM177A1 Associated Disorder', 'Malan Syndrome',
       'SETBP1-related disorder', 'CHOPS Syndrome',
       'HUWE1-related disorders', '4H Leukodystrophy',
       '8p-related disorders', 'Ogden Syndrome (NAA10)',
       'AHC (Alternating Hemiplegia of Childhood)',
       'MSL3 -related disorders', 'KDM5C-related disorders',
       'ARHGEF9-related disorders'], dtype=object)

## Get the average response to the questions, ignore the missing data

Impose a minumum number of cases, this is an optional choice

In [6]:
def mean_if_enough_cases(x, n=5):
    if x.count()>=n:
        return np.mean(x)
    else:
        return np.nan

freq_table = df.groupby('Disease_Name').agg(mean_if_enough_cases)

In [7]:
freq_table.loc[freq_table.isna().all(axis=1)].index

Index(['ARHGEF9-related disorders', 'CHOPS Syndrome',
       'FAM177A1 Associated Disorder', 'NARS1 genetic mutation'],
      dtype='object', name='Disease_Name')

In [8]:
rx_frequency = freq_table.melt(ignore_index=False,var_name='Symptom',value_name='Frequency')
rx_very_frequent = rx_frequency.loc[rx_frequency['Frequency'] >= 0.8]

In [9]:
rx_very_frequent.sort_values('Frequency',ascending=False).head(15)

Unnamed: 0_level_0,Symptom,Frequency
Disease_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
8p-related disorders,Abnormal_Muscle_Function_Symptom_Present,1.0
Ogden Syndrome (NAA10),Cognitive_Impairment_Symptom_Present,1.0
CHAMP1 related disorders,PELHS Autism/PDD,1.0
Koolen-de Vries Syndrome,Seizures_Symptom_Present,1.0
SETBP1-related disorder,Hypotonia_Symptom_Present,1.0
Pallister-Killian mosaic syndrome (PKS),Hypotonia_Symptom_Present,1.0
MSL3 -related disorders,Hypotonia_Symptom_Present,1.0
Koolen-de Vries Syndrome,Hypotonia_Symptom_Present,1.0
CHAMP1 related disorders,Hypotonia_Symptom_Present,1.0
Pallister-Killian mosaic syndrome (PKS),Abnormal_EEG_Symptom_Present,1.0


In [10]:
rx_very_frequent['Symptom'].value_counts().head(10)

Symptom
Cognitive_Impairment_Symptom_Present        15
Hypotonia_Symptom_Present                   13
Coordination_Issues_Symptom_Present         10
PELHS Intellectual disability                4
Seizures_Symptom_Present                     4
Abnormal_EEG_Symptom_Present                 4
Abnormal_Muscle_Function_Symptom_Present     4
ASD_Symptom_Present                          2
PELHS Microcephaly (<5th percentile)         2
Short_Attention_Span_Symptom_Present         1
Name: count, dtype: int64

Save the very frequent symptoms to manually inspect:

In [11]:
rx_very_frequent.to_csv('RX_Task_1_Very_Frequent.csv')

# Step 2: Compare RARE-X Data to Orphanet

Orphanet tracks phenotypes and symptoms for various rare diseases, with the following classifications:
 - Very frequent: more than 80% 
 - Frequent: between 30% and 80% 
 - Occasional: fewer than 30% 

Download OrphaNet data `en_product4.xml` from https://www.orphadata.com/phenotypes/

In [12]:
tree = ET.parse('en_product4.xml')
root = tree.getroot()

In [13]:
# parse xml to dataframe
pf=[]
for child in root[1]:
    pf.extend([(child[0][2].text,p[0][1].text, p[1][0].text,p[0][0].text) for p in child[0][5]])
pheno_df = pd.DataFrame(pf,columns=['Disease_Name','Phenotype','Frequency','HPO Code'])

Note that all diseases are in the Orphanet data.  Map the RARE-X disease names to the Orphanet names:

In [14]:
rarex_orphanet_mapping = {'Kleefstra syndrome':'Kleefstra syndrome', 'Koolen-de Vries Syndrome':'Koolen-De Vries syndrome',
       'CHD2 related disorders':None, 'FOXP1 Syndrome':'Intellectual disability-severe speech delay-mild dysmorphism syndrome',
       'Classic homocystinuria':'Classic homocystinuria', 'DYRK1A Syndrome':'DYRK1A-related intellectual disability syndrome',
       'CHAMP1 related disorders':None, 'STXBP1 related Disorders':None,
       'Pallister-Killian mosaic syndrome (PKS)':'Tetrasomy 12p',
       'CACNA1A related disorders':None, 'Ring14 and related disorders':None,
       'Wiedemann-Steiner Syndrome (WSS)':'Wiedemann-Steiner syndrome', 'SYNGAP1 related disorders':'SYNGAP1-related developmental and epileptic encephalopathy',
       'CASK-Related Disorders':None, 'NARS1 genetic mutation':None,
       'FAM177A1 Associated Disorder':None, 'Malan Syndrome':'Malan overgrowth syndrome',
       'SETBP1-related disorder':None, 'CHOPS Syndrome':'Cognitive impairment-coarse facies-heart defects-obesity-pulmonary involvement-short stature-skeletal dysplasia syndrome',
       'HUWE1-related disorders':None, '4H Leukodystrophy':'4H leukodystrophy',
       '8p-related disorders':'8p inverted duplication/deletion syndrome', 'Ogden Syndrome (NAA10)':'Ogden syndrome',
       'AHC (Alternating Hemiplegia of Childhood)':'Alternating hemiplegia of childhood',
       'MSL3 -related disorders':None, 'KDM5C-related disorders':'KDM5C-related syndromic X-linked intellectual disability',
       'ARHGEF9-related disorders':None}

In [15]:
rx_in_orphanet = [k for k,v in rarex_orphanet_mapping.items() if v]

In [16]:
print(f'Diseases in RARE-X data: {len(rarex_orphanet_mapping)} \nDiseases also in Orphanet: {len(set(rarex_orphanet_mapping.values()))-1}\n')
print(f'Diseases missing in Orphanet: {[k for k,v in rarex_orphanet_mapping.items() if not v]}')

Diseases in RARE-X data: 27 
Diseases also in Orphanet: 15

Diseases missing in Orphanet: ['CHD2 related disorders', 'CHAMP1 related disorders', 'STXBP1 related Disorders', 'CACNA1A related disorders', 'Ring14 and related disorders', 'CASK-Related Disorders', 'NARS1 genetic mutation', 'FAM177A1 Associated Disorder', 'SETBP1-related disorder', 'HUWE1-related disorders', 'MSL3 -related disorders', 'ARHGEF9-related disorders']


Limit the analysis to look at very frequent phenotypes for the selected diseases:

In [17]:
orpha_subset = pheno_df.loc[pheno_df['Disease_Name'].isin(list(rarex_orphanet_mapping.values())) & (pheno_df['Frequency'] == 'Very frequent (99-80%)'),:].copy()

In [18]:
orpha_subset.loc[orpha_subset['Disease_Name'] == 'Kleefstra syndrome'].to_csv('Kleefstra_Orphanet.csv',index=False)

In [19]:
orpha_subset.loc[orpha_subset['Disease_Name'] == 'DYRK1A-related intellectual disability syndrome'].to_csv('DYRK1A_Orphanet.csv',index=False)

In [20]:
df.loc[df.Disease_Name=='DYRK1A Syndrome',:].drop(columns='Disease_Name').agg(np.mean).sort_values(ascending=False).to_csv('DYRK1A_RX.csv')
df.loc[df.Disease_Name=='Kleefstra syndrome',:].drop(columns='Disease_Name').agg(np.mean).sort_values(ascending=False).to_csv('Kleefstra_RX.csv')

### Look up parent HPO phenotypes 
Download ontology file `hp.json` from https://hpo.jax.org/app/data/ontology


In [21]:
with open('hp.json', 'r') as f:
    hpo = json.load(f)

Look up a few levels in the phenotype hierarchy to get some broad categories

In [22]:
def hpo_parent(code,code_col='HPO Code'):
    number = code[code_col].split(':')[1]
    sub_str = 'http://purl.obolibrary.org/obo/HP_'+number
    for edge in hpo['graphs'][0]['edges']:
        if edge['sub']==sub_str:
            if edge['pred'] == 'is_a':
                obj = edge['obj']
                break
    try:
        for node in  hpo['graphs'][0]['nodes']:
            if node['id'] == obj:
                lbl = node['lbl']
                break
        return pd.Series(['HP:'+obj.split('HP_')[1], lbl])
    except UnboundLocalError:
        lbl = ''
        return pd.Series(['HP:'+' NOT FOUND', lbl])
    
    

In [23]:
orpha_subset[['L-1 HPO Code', 'L-1 Phenotype']]= orpha_subset.apply(hpo_parent,axis=1)

In [24]:
orpha_subset[['L-2 HPO Code', 'L-2 Phenotype']]= orpha_subset.apply(lambda x: hpo_parent(x,code_col='L-1 HPO Code'),axis=1)
orpha_subset[['L-3 HPO Code', 'L-3 Phenotype']]= orpha_subset.apply(lambda x: hpo_parent(x,code_col='L-2 HPO Code'),axis=1)
orpha_subset[['L-4 HPO Code', 'L-4 Phenotype']]= orpha_subset.apply(lambda x: hpo_parent(x,code_col='L-3 HPO Code'),axis=1)
orpha_subset[['L-5 HPO Code', 'L-5 Phenotype']]= orpha_subset.apply(lambda x: hpo_parent(x,code_col='L-4 HPO Code'),axis=1)

We're less interested in facial or morphology attributes that are common in the Orphanet data, so remove them from the list:

In [25]:
orpha_subset = orpha_subset.loc[~(orpha_subset['L-5 Phenotype'].str.contains('morphology|face') | \
                                  orpha_subset['L-4 Phenotype'].str.contains('morphology|face') | \
                                  orpha_subset['L-3 Phenotype'].str.contains('morphology|face') | \
                                  orpha_subset['L-2 Phenotype'].str.contains('morphology|face') | \
                                  orpha_subset['L-1 Phenotype'].str.contains('morphology|face'))]

Create separate tables for diseases without Orphanet entries, and a table with the RARE-X and Orphanet entries merged.  This will save several Excel files that could be useful for manual inspection.

In [26]:
rx_only_list = []
rx_orphanet_list = []
for k,v in rarex_orphanet_mapping.items():
    with pd.ExcelWriter(f"{k}.xlsx") as writer:
        rx_very_frequent.loc[rx_very_frequent.index==k,:].to_excel(writer, sheet_name="RARE-X", index=False)
        if v:
            orpha_subset.loc[orpha_subset['Disease_Name']==v,:].to_excel(writer, sheet_name="Orphanet", index=False)
            print(f'{k}, n Symptoms: {rx_very_frequent.loc[rx_very_frequent.index==k,:].shape[0]}, Orphanet: {orpha_subset.loc[orpha_subset["Disease_Name"]==v,:].shape[0]}')
            temp = orpha_subset.loc[orpha_subset['Disease_Name']==v,['Phenotype','Frequency']]
            temp['RX Disease_Name'] = k
            temp.set_index('RX Disease_Name',inplace=True)
            rx_orphanet_list.append(pd.concat([rx_very_frequent.loc[rx_very_frequent.index==k,:],temp]))

        else:
            print(f'{k}, n Symptoms: {rx_very_frequent.loc[rx_very_frequent.index==k,:].shape[0]}, Not in Orphanet')
            rx_only_list.append(rx_very_frequent.loc[rx_very_frequent.index==k,:])
rx_only = pd.concat(rx_only_list)
rx_orphanet = pd.concat(rx_orphanet_list)


Kleefstra syndrome, n Symptoms: 4, Orphanet: 4
Koolen-de Vries Syndrome, n Symptoms: 4, Orphanet: 5
CHD2 related disorders, n Symptoms: 3, Not in Orphanet
FOXP1 Syndrome, n Symptoms: 2, Orphanet: 3
Classic homocystinuria, n Symptoms: 0, Orphanet: 4
DYRK1A Syndrome, n Symptoms: 2, Orphanet: 5
CHAMP1 related disorders, n Symptoms: 8, Not in Orphanet
STXBP1 related Disorders, n Symptoms: 3, Not in Orphanet
Pallister-Killian mosaic syndrome (PKS), n Symptoms: 4, Orphanet: 10
CACNA1A related disorders, n Symptoms: 2, Not in Orphanet
Ring14 and related disorders, n Symptoms: 3, Not in Orphanet
Wiedemann-Steiner Syndrome (WSS), n Symptoms: 2, Orphanet: 1
SYNGAP1 related disorders, n Symptoms: 8, Orphanet: 5
CASK-Related Disorders, n Symptoms: 3, Not in Orphanet
NARS1 genetic mutation, n Symptoms: 0, Not in Orphanet
FAM177A1 Associated Disorder, n Symptoms: 0, Not in Orphanet
Malan Syndrome, n Symptoms: 5, Orphanet: 1
SETBP1-related disorder, n Symptoms: 2, Not in Orphanet
CHOPS Syndrome, n Sy

# Step 3: Analyze and make Conclusions

### RARE-X only diseases

The following diseses without Orphanet entries show that coordination, hypotonia, cognitive/intellectual disability are common.  Ring14 shows GI issues and CHAMP1 shows some that Behavioral and Sleep issues are very frequent.

In [27]:
rx_only.sort_index()

Unnamed: 0_level_0,Symptom,Frequency
Disease_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
CACNA1A related disorders,Coordination_Issues_Symptom_Present,0.916667
CACNA1A related disorders,Hypotonia_Symptom_Present,0.891892
CASK-Related Disorders,Cognitive_Impairment_Symptom_Present,0.833333
CASK-Related Disorders,Coordination_Issues_Symptom_Present,0.916667
CASK-Related Disorders,PELHS Microcephaly (<5th percentile),0.875
CHAMP1 related disorders,PELHS Sleep disorder,0.8
CHAMP1 related disorders,PELHS Migraines,0.8
CHAMP1 related disorders,PELHS Intellectual disability,0.8
CHAMP1 related disorders,PELHS Autism/PDD,1.0
CHAMP1 related disorders,PELHS Anxiety,0.8


### Look at diseases with an Orphanet entry

Examine each disease in both datasets.  RARE-X diseases are in the `Symptom` column and Orphanet Phenotypes in the `Phenotypes` column

Kleefstra shows agreement for Hypotonia and Cognitive/Intellectual disabilities:

In [28]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[0]]

Unnamed: 0,Symptom,Frequency,Phenotype
Kleefstra syndrome,Cognitive_Impairment_Symptom_Present,0.888889,
Kleefstra syndrome,Coordination_Issues_Symptom_Present,0.833333,
Kleefstra syndrome,Hypotonia_Symptom_Present,0.894737,
Kleefstra syndrome,PELHS Intellectual disability,0.818182,
Kleefstra syndrome,,Very frequent (99-80%),Delayed speech and language development
Kleefstra syndrome,,Very frequent (99-80%),Hypotonia
Kleefstra syndrome,,Very frequent (99-80%),Global developmental delay
Kleefstra syndrome,,Very frequent (99-80%),"Intellectual disability, severe"


Koolen-de Vries shows agreement for Hypotonia and Cognitive/Intellectual disabilities, but the lists differ in EEG/Seizures and Overfriendliness 

In [29]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[1]]

Unnamed: 0,Symptom,Frequency,Phenotype
Koolen-de Vries Syndrome,Cognitive_Impairment_Symptom_Present,0.833333,
Koolen-de Vries Syndrome,Abnormal_EEG_Symptom_Present,1.0,
Koolen-de Vries Syndrome,Hypotonia_Symptom_Present,1.0,
Koolen-de Vries Syndrome,Seizures_Symptom_Present,1.0,
Koolen-de Vries Syndrome,,Very frequent (99-80%),Ptosis
Koolen-de Vries Syndrome,,Very frequent (99-80%),Intellectual disability
Koolen-de Vries Syndrome,,Very frequent (99-80%),Hypotonia
Koolen-de Vries Syndrome,,Very frequent (99-80%),Global developmental delay
Koolen-de Vries Syndrome,,Very frequent (99-80%),Overfriendliness


FOXP1 Agrees on cognitive impairment, but RARE-X includes Coordination issues

In [30]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[2]]

Unnamed: 0,Symptom,Frequency,Phenotype
FOXP1 Syndrome,Cognitive_Impairment_Symptom_Present,0.875,
FOXP1 Syndrome,Coordination_Issues_Symptom_Present,0.88,
FOXP1 Syndrome,,Very frequent (99-80%),Delayed speech and language development
FOXP1 Syndrome,,Very frequent (99-80%),Expressive language delay
FOXP1 Syndrome,,Very frequent (99-80%),Speech articulation difficulties


Classic homocystinuria doesnt have any RARE-X Symptoms that met the criteria I used

In [31]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[3]]

Unnamed: 0,Symptom,Frequency,Phenotype
Classic homocystinuria,,Very frequent (99-80%),Intellectual disability
Classic homocystinuria,,Very frequent (99-80%),Disproportionate tall stature
Classic homocystinuria,,Very frequent (99-80%),Recurrent fractures
Classic homocystinuria,,Very frequent (99-80%),Abnormality of amino acid metabolism


DYRK1A Syndrome	has ASD Symptoms and Microcephaly from RARE-X and developmental delays from Orphanet

In [32]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[4]]

Unnamed: 0,Symptom,Frequency,Phenotype
DYRK1A Syndrome,ASD_Symptom_Present,0.857143,
DYRK1A Syndrome,PELHS Microcephaly (<5th percentile),0.875,
DYRK1A Syndrome,,Very frequent (99-80%),Delayed speech and language development
DYRK1A Syndrome,,Very frequent (99-80%),Intellectual disability
DYRK1A Syndrome,,Very frequent (99-80%),Global developmental delay
DYRK1A Syndrome,,Very frequent (99-80%),Gait disturbance
DYRK1A Syndrome,,Very frequent (99-80%),Feeding difficulties


PKS has agreement for Hypotonia and Cognitive/Intellectual disabilities, RARE-X data flags EEG and Seizure activity, while Orphanet has several phenotypes related to growth and skeletal system

In [33]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[5]]

Unnamed: 0,Symptom,Frequency,Phenotype
Pallister-Killian mosaic syndrome (PKS),Cognitive_Impairment_Symptom_Present,1.0,
Pallister-Killian mosaic syndrome (PKS),Abnormal_EEG_Symptom_Present,1.0,
Pallister-Killian mosaic syndrome (PKS),Hypotonia_Symptom_Present,1.0,
Pallister-Killian mosaic syndrome (PKS),Seizures_Symptom_Present,0.833333,
Pallister-Killian mosaic syndrome (PKS),,Very frequent (99-80%),Ptosis
Pallister-Killian mosaic syndrome (PKS),,Very frequent (99-80%),Sparse eyebrow
Pallister-Killian mosaic syndrome (PKS),,Very frequent (99-80%),Hypohidrosis
Pallister-Killian mosaic syndrome (PKS),,Very frequent (99-80%),Hypotonia
Pallister-Killian mosaic syndrome (PKS),,Very frequent (99-80%),Reduced tendon reflexes
Pallister-Killian mosaic syndrome (PKS),,Very frequent (99-80%),Delayed skeletal maturation


WSS showed Hypotonia and Short stature as very frequent in the RARE-X data, while only delayed speech was noted as very frequent in the Orphanet data

In [34]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[6]]

Unnamed: 0,Symptom,Frequency,Phenotype
Wiedemann-Steiner Syndrome (WSS),Short_Stature_Symptom_Present,0.843137,
Wiedemann-Steiner Syndrome (WSS),Hypotonia_Symptom_Present,0.808511,
Wiedemann-Steiner Syndrome (WSS),,Very frequent (99-80%),Delayed speech and language development


SYNGAP1 flagged Behavioral symptoms and Hypotonia in the RARE-X data, while Orphanet noted abmormal pain sensation, both agreed on Intellectual disability and Seizures 

In [35]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[7]]

Unnamed: 0,Symptom,Frequency,Phenotype
SYNGAP1 related disorders,ASD_Symptom_Present,0.846154,
SYNGAP1 related disorders,Temper_Tantrums_Symptom_Present,0.807692,
SYNGAP1 related disorders,Cognitive_Impairment_Symptom_Present,1.0,
SYNGAP1 related disorders,Coordination_Issues_Symptom_Present,0.863636,
SYNGAP1 related disorders,Abnormal_EEG_Symptom_Present,0.909091,
SYNGAP1 related disorders,Hypotonia_Symptom_Present,0.909091,
SYNGAP1 related disorders,Seizures_Symptom_Present,0.9,
SYNGAP1 related disorders,PELHS Intellectual disability,0.857143,
SYNGAP1 related disorders,,Very frequent (99-80%),Delayed speech and language development
SYNGAP1 related disorders,,Very frequent (99-80%),Intellectual disability


Malan Syndrome noted Hypotonia and Cognitive Symptoms in the RARE-X data and Orphanet corroborated overgrowth

In [36]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[8]]

Unnamed: 0,Symptom,Frequency,Phenotype
Malan Syndrome,General_Overgrowth_Symptom_Present,0.814815,
Malan Syndrome,Cognitive_Impairment_Symptom_Present,0.923077,
Malan Syndrome,Coordination_Issues_Symptom_Present,0.814815,
Malan Syndrome,Hypotonia_Symptom_Present,0.96,
Malan Syndrome,PELHS Macrocephaly (>95th percentile),0.8125,
Malan Syndrome,,Very frequent (99-80%),Accelerated skeletal maturation


CHOPS did not have any RARE-X symptoms that met the criteria used in this analysis

In [37]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[9]]

Unnamed: 0,Symptom,Frequency,Phenotype
CHOPS Syndrome,,Very frequent (99-80%),Intellectual disability
CHOPS Syndrome,,Very frequent (99-80%),Global developmental delay
CHOPS Syndrome,,Very frequent (99-80%),Obesity
CHOPS Syndrome,,Very frequent (99-80%),Abnormality of the respiratory system
CHOPS Syndrome,,Very frequent (99-80%),Short stature
CHOPS Syndrome,,Very frequent (99-80%),Abnormality of skeletal morphology


4H was associated with Cognitive Impairment in the RARE-X data, and agreed with the mobility symptoms from Orphanet

In [38]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[10]]

Unnamed: 0,Symptom,Frequency,Phenotype
4H Leukodystrophy,Cognitive_Impairment_Symptom_Present,1.0,
4H Leukodystrophy,Coordination_Issues_Symptom_Present,1.0,
4H Leukodystrophy,,Very frequent (99-80%),Hypogonadotrophic hypogonadism
4H Leukodystrophy,,Very frequent (99-80%),Myopia
4H Leukodystrophy,,Very frequent (99-80%),Ataxia
4H Leukodystrophy,,Very frequent (99-80%),Dysarthria
4H Leukodystrophy,,Very frequent (99-80%),Dystonia


8p RARE-X data agreed with the abnormal muscle function phenotypes in Orphanet

In [39]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[11]]

Unnamed: 0,Symptom,Frequency,Phenotype
8p-related disorders,Abnormal_Muscle_Function_Symptom_Present,1.0,
8p-related disorders,,Very frequent (99-80%),Delayed speech and language development
8p-related disorders,,Very frequent (99-80%),Global developmental delay
8p-related disorders,,Very frequent (99-80%),Intellectual disability
8p-related disorders,,Very frequent (99-80%),"Intellectual disability, mild"
8p-related disorders,,Very frequent (99-80%),Hypertonia
8p-related disorders,,Very frequent (99-80%),Spastic tetraplegia
8p-related disorders,,Very frequent (99-80%),Abnormality of chromosome segregation
8p-related disorders,,Very frequent (99-80%),Infantile muscular hypotonia
8p-related disorders,,Very frequent (99-80%),"Intellectual disability, severe"


For Ogden, RARE-X highlights movement and cognitive/intellectual symptoms, no Orphanet phenotypes met the filtering criteria used here

In [40]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[12]]

Unnamed: 0,Symptom,Frequency,Phenotype
Ogden Syndrome (NAA10),Abnormal_Muscle_Function_Symptom_Present,1.0,
Ogden Syndrome (NAA10),Cognitive_Impairment_Symptom_Present,1.0,
Ogden Syndrome (NAA10),Coordination_Issues_Symptom_Present,1.0,
Ogden Syndrome (NAA10),Hypotonia_Symptom_Present,0.875,
Ogden Syndrome (NAA10),PELHS Intellectual disability,0.857143,


AHC has symptoms related to muscle strength, coordination, and cognition in the RARE-X data, while GI symptoms are noted in Orphanet

In [41]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[13]]

Unnamed: 0,Symptom,Frequency,Phenotype
AHC (Alternating Hemiplegia of Childhood),Cognitive_Impairment_Symptom_Present,0.8,
AHC (Alternating Hemiplegia of Childhood),Coordination_Issues_Symptom_Present,0.928571,
AHC (Alternating Hemiplegia of Childhood),Hypotonia_Symptom_Present,0.866667,
AHC (Alternating Hemiplegia of Childhood),,Very frequent (99-80%),Gastrointestinal dysmotility
AHC (Alternating Hemiplegia of Childhood),,Very frequent (99-80%),Abnormality of the gastrointestinal tract
AHC (Alternating Hemiplegia of Childhood),,Very frequent (99-80%),Episodic hemiplegia


KDM5C has behavioral identified in the RARE-X data, and Orphanet also agrees with the cognitive impairment finding.

In [42]:
rx_orphanet.loc[rx_orphanet.index==rx_in_orphanet[14]]

Unnamed: 0,Symptom,Frequency,Phenotype
KDM5C-related disorders,Anxiety_Symptom_Present,0.833333,
KDM5C-related disorders,Impulsivity_Symptom_Present,0.857143,
KDM5C-related disorders,Cognitive_Impairment_Symptom_Present,0.8,
KDM5C-related disorders,,Very frequent (99-80%),Delayed speech and language development
KDM5C-related disorders,,Very frequent (99-80%),Alopecia areata
KDM5C-related disorders,,Very frequent (99-80%),"Intellectual disability, severe"


## Final Thoughts

In this analysis, the common symptoms from RARE-X have been analyzed and compared to Orphanet. In general Hypotonia and Siezure symptoms were more commonly identified as very frequent in the RARE-X data compared to Orphanet.  Twelve of the diseases didnt have a distinct Orphanet entry.  With the (arbitrary) constraints used the following diseases were excluded:

 - ARHGEF9-related disorders
 - CHOPS Syndrome
 - FAM177A1 Associated Disorder
 - NARS1 genetic mutation


The code is available to be run on new sets of similar/future data from RARE-X, and we compared the findings of very frequent symptoms to the Orphanet phenotypes.


### How you can improve this?

This was a basic analysis of the RARE-X data using one other data source to confirm the survey findings.  Some ideas where your work could show improvement:  

 - *Better Statistics* I used a simple frequency metric in this solution with an arbitrary minimum number of cases.  The missing data was simply ignored here.  Consider statistics around sample sizes and population frequencies to identify more significant phenotypes.
 - *More Background* I used some data from Orphanet, but not all of the diseases were covered there and not all phenotypes are covered.  Other literature sources would be great!
 - *Broader scope* Only "Very Frequent" characteristics were considered here,  and some morphology features were dropped from Orphanet to reduce the number of phenotypes, including others would be even better.
 - *Something else* Bring your creativity to identify new types of analysis and sources to include!