# Explore creation of phenopackets from supplemental material
Let's take Platzer K. De Novo Variants in MAPK8IP3 Cause Intellectual Disability with Variable Brain Anomalies. Am J Hum Genet. 2019 Feb 7;104(2):203-212. PMC6369540.
as an example

In [1]:
import phenopackets as pp
import pandas as pd
from collections import defaultdict
import os
import sys

sys.path.insert(0, os.path.abspath('../../pyphetools'))
from pyphetools import CohortEncoder, ColumnMapper, HpoParser

Supplemental Table S1 contains Detailed Clinical Information for All Individuals with Causative De Novo Variants in MAPK8IP3. We need to read this from the original Excel file because some of the cells contain new-line symbols.

In [2]:
df = pd.read_excel('data/mmc2.xlsx')

In [3]:
df.head()

Unnamed: 0,Indvidual\nin\nmanuscript,g.(hg19) Chr16:,Transcript\nNM_015133.4\nc.,p.,origin,genetic testing,Sex,age at last assesment,prenatal period,Exam at birth,...,neurological examination,result of external MRI,seizures,Sz onset and Sz types,AEDs used,Sz outcome,EEG,Additional symptoms,family history,further results of genetic testing
0,1,1756405,c.65delG,p.Gly22Alafs*3,de novo,TrioWES,M,14 y 8 m,,41 weeks:\nlength: 53.3 cm\nweight: 3.941 kg\n...,...,ataxia,"mild cerebellar atrophy, hypointensity of the ...",no,,,,,speech is ataxic but speaks in sentences/short...,unremarkable,
1,2,1756419,c.79G>T,p.Glu27*,de novo,SingleWES,M,4 y,,length: 49 cm\nweigth: 3215 g\nOFC: 35 cm,...,ataxia,normal,no,,,,,pre-natal pelvi-ureteric junction stenosis (sp...,,
2,3,1756451,c.111C>G,p.Tyr37*,de novo,TrioWES,M,4 y,,length: 20.5 in\nweight: 8 lb 2 oz\nOFC: NA,...,,Stable areas of T2 hyperintensity involving th...,no,,,,,Nystagmus,unremarkable,770 kb duplicaion of 20p12.3 on chromosome mic...
3,4,1798706,c.1198G>A,p.Gly400Arg,de novo,TrioWES,M,7 y 6 m,"no prenatal care, no known problems","32 weeks:\nlength: NA,\nweight: 4 lbs,\nOFC: N...",...,,no MRI done,no,,,,,Left hearing loss; Dysmorphic features: hypert...,Mother with learning disorder; finished 11th g...,
4,5,1810410,c.1331T>C,p.Leu444Pro,de novo,TrioWES,M,10 y,,"40 weeks, length: 52 cm\nweight: 3810 g\nOFC:...",...,,perisylvian polymicrogyria,yes,10 y:\none event of a generalized seizure,,,pathological EEG with normal age-related backg...,"no dysmorphism, small teeth, severe s-configur...",,


In [4]:
df.columns


Index(['Indvidual\nin\nmanuscript', 'g.(hg19) Chr16:',
       'Transcript\nNM_015133.4\nc.', 'p.', 'origin', 'genetic testing', 'Sex',
       'age at last assesment', 'prenatal period', 'Exam at birth',
       'body measurements\n(at last assesment if not otherwise specified)',
       'DD', 'severity of ID', 'development', 'regression', 'autism',
       'hypotonia', 'movement disorder', 'CVI', 'neurological examination',
       'result of external MRI', 'seizures', 'Sz onset and Sz types',
       'AEDs used', 'Sz outcome', 'EEG', 'Additional symptoms',
       'family history', 'further results of genetic testing'],
      dtype='object')

#We need to create a dictionary with the HPO terms contained in the descriptions

In [5]:
hpo_json_path = '/Users/robinp/data/hpo/hp.json'
parser = HpoParser(hpo_json_file=hpo_json_path)

In [6]:
hpo_version = parser.get_version()
primary_label_to_id_d = parser.get_primary_labels_to_id_map()
label_to_primary_label_d = parser.get_label_to_primary_label_map()
print(f"HPO version {hpo_version}")
print(f"primary labels n = {len(primary_label_to_id_d)}")
print(f"total labels {len(label_to_primary_label_d)}")

HPO version 2022-10-05
primary labels n = 17059
total labels 17059


## Labels to term id map
We need to extend OAK to allow us to extract all synonyms for each HPO term (excluding abbreviations, which have a few equalities). For now, we can just take the primary ids.

In [8]:
cohort_encoder = CohortEncoder(df=df, id_to_primary_d=id_to_primary_d, label_to_id_d=primary_label_to_id_d)

NameError: name 'id_to_primary_d' is not defined

##  ColumnMapper
The idea is to make one ColumnMapper object for each column of interest. The column mapper knows how to map the contents using either
default exact text matching or custom maps from whatever strings to HPO terms.

In [16]:
df['neurological examination'].unique()

array(['ataxia', nan, 'spastic paraplegia',
       'spasticity; nerve conduction and EMG studies with abnormal findings "remarkable for the failure to activate the leg muscles due to an upper motor neuron pattern of aberrant motor unit potential firing rates. These findings are consistent with dysfunction of the corticospinal pathways rather than a lower motor unit." Significant low extremity weakness.',
       'spasticity/stiff legs', 'spastic diplegic cerebral palsy',
       'orobuccal dyspraxia, awkward gross and fine motricity, difficulty in coordination, unstable gait'],
      dtype=object)

In [9]:
type(df['neurological examination'])

pandas.core.series.Series

In [8]:
neuro_exam_custom_map = {'low extremity weakness': 'Lower limb muscle weakness'}

In [None]:
neuroMapper = coh
ColumnMapper(custom_map_d=neuro_exam_custom_map)
neuroMapper.preview()