# Exploring the first release of the panelapp data


Made available on: 2020.10.01


**Conclusions:**

* Number of total assoc: 60k
* NUmber of significant assoc: 42K (panel version > 1.0 & green or amber)
* 1k of the sigificant assoc is missing phenotypes
* 48k total phenotypes (~1.2 phenotype per assoc)
* 27k unique terms.
* Out of 100 randomly selected phenotypes 96 could be mapped, 13 with exact match.

In [7]:
import json 
import pandas as pd
import gzip

panelapp_file = '/Users/dsuveges/project/evidences/All_genes_20200928-1959.tsv'

panelapp_df = pd.read_csv(panelapp_file, sep='\t')
print(f'number of associations: {len(panelapp_df)}')

# The data needs to be filterd for the following criteria:
# 1. panel version >= 1.0
# 2. association is green or amber

panelapp_signif = panelapp_df.loc[(panelapp_df['Panel Version'] >= 1) &
                                 ((panelapp_df['List'] == 'green') |
                                 (panelapp_df['List'] == 'amber'))]
print(f'number of significant associations: {len(panelapp_signif)}')
print(f'number of genes in the signif set: {len(panelapp_signif["EnsemblId(GRch38)"].unique())}')


number of associations: 60772
number of significant associations: 42371
number of genes in the signif set: 4002


### What about the phenotypes?

In [21]:
# Significant associations with no phenotypes:
print(f'Missing phenotypes annotation: {len(panelapp_signif.loc[panelapp_signif.Phenotypes.isna()])}')


Missing phenotypes annotation: 1072


In [29]:
# Containers for phenotypes;
phenotypes = []

# Filter out unique set of phenotypes:
phenotype_series = pd.Series(
    panelapp_df.
    loc[~panelapp_df.Phenotypes.isna()]
    .Phenotypes
    .unique()
)

# Add phenotypes:
phenotype_series.apply(lambda x: [phenotypes.append(y) for y in x.split(';') if isinstance(x, str)])

print(f'Total number of phenotype terms: {len(phenotypes)}')
print(f'number of unique phenotype terms: {len(set(phenotypes))}')

Total number of phenotype terms: 48446
number of unique phenotype terms: 27173


In [33]:
# from ontoma import OnToma
import random


otmap = OnToma()

INFO     - ontoma.downloaders - ZOOMA to EFO mappings - Parsed 3666 rows
INFO     - ontoma.downloaders - OMIM to EFO mappings - Parsed 8215 rows


In [54]:
mappings = []
for phenotype in random.sample(phenotypes, 100):
    mappings.append(otmap.find_term(phenotype,verbose=True))

mappings_df = pd.DataFrame([x for x in mappings if x is not None ])
print(f'mapping found for {len(mappings_df)} terms.')
print(f'exact match found for {len(mappings_df.loc[mappings_df.quality == "match"])}')

INFO     - ontoma.interface - Found http://www.orpha.net/ORDO/Orphanet_183660 for SCID from OLS API EFO lookup - match - None
ERROR    - ontoma.interface - Could not find *any* term for string: Neurodevelopmental disorder with absent language and variable seizures, 618707
ERROR    - ontoma.interface - Could not find *any* term for string: Defects in intrinsic and innate immunity
INFO     - ontoma.interface - Found http://www.ebi.ac.uk/efo/EFO_0000378 for Coronary artery disease from EFO OBO - match - None
INFO     - ontoma.interface - Found http://purl.obolibrary.org/obo/HP_0002353 for EEG abnormality from Zooma API lookup - match - None
INFO     - ontoma.interface - Found http://www.orpha.net/ORDO/Orphanet_68380 for Mitochondrial Diseases from OT Zooma Mappings - match - None
ERROR    - ontoma.interface - Could not find *any* term for string: ?Immunodeficiency 37, 616098
INFO     - ontoma.interface - Found http://www.orpha.net/ORDO/Orphanet_313838 for CEREBRORETINAL MICROANGIOPATHY WI

INFO     - ontoma.interface - Found http://purl.obolibrary.org/obo/HP_0001263 for Global developmental delay from Zooma API lookup - match - None
ERROR    - ontoma.interface - Could not find *any* term for string: Erythremias, beta-
INFO     - ontoma.interface - Found http://www.orpha.net/ORDO/Orphanet_79401 for Epidermolysis Bullosa Simplex, Ogna Type from OT Zooma Mappings - match - None
INFO     - ontoma.interface - Found http://www.orpha.net/ORDO/Orphanet_325004 for CANDLE syndrome from OLS API EFO lookup - match - None
INFO     - ontoma.interface - Found http://www.orpha.net/ORDO/Orphanet_1775 for Dyskeratosis congenita, autosomal recessive 4 from OT Zooma Mappings - match - None
INFO     - ontoma.interface - Found http://purl.obolibrary.org/obo/HP_0001760 for Abnormality of the foot from Zooma API lookup - match - None
INFO     - ontoma.interface - Found http://www.orpha.net/ORDO/Orphanet_2396 for Encephalocraniocutaneous lipomatosis from EFO OBO - match - None
INFO     - ontoma.



mapping found for 96 terms.
exact match found for 13


In [56]:
panelapp_df.columns

Index(['Symbol', 'Panel Id', 'Panel Name', 'Panel Version', 'Panel Status',
       'List', 'Sources', 'Mode of inheritance', 'Mode of pathogenicity',
       'Tags', 'EnsemblId(GRch37)', 'EnsemblId(GRch38)', 'HGNC', 'Biotype',
       'Phenotypes', 'GeneLocation((GRch37)', 'GeneLocation((GRch38)',
       'Panel Types', 'Super Panel Id', 'Super Panel Name',
       'Super Panel Version'],
      dtype='object')