CMS (Centers for Medicare & Medicaid Services) has made available a synthetic PUF (Public Use File) of Medicare claims. This data set is known as [SYNPUF](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/). It contains generated fake diagnoses, demographics and procedures for visits. The data is stored in a terse format which is difficult to work with. The data can be transformed using an [ETL script](https://github.com/OHDSI/ETL-CMS/blob/master/python_etl/README.md) [OHDSI](https://www.ohdsi.org/)'s [Common Data Model](https://github.com/OHDSI/CommonDataModel). The tranformed SYNPUF data (version 5) was loaded into a PostGreSQL database server. Using a [mapper script](https://github.com/jhajagos/TransformDBtoHDF5ML) inpatient visits are mapped into separate matrices.

The goal is to build a predictive 30-day inpatient readmission model. In order to build such a model the data needs to be proceesed further. This part of the tutorial explores basic analysis of the synthetic clinical HDF5 container file using [h5py](http://www.h5py.org/) and [NumPy](http://www.numpy.org/). 

In [1]:
import h5py # Library for reading HDF5 files

In [2]:
import numpy as np # Numerical matrix library

In [3]:
f5 = h5py.File("synpuf_inpatient_combined_readmission.hdf5", "r")

The `f5` object controls access to the underlying data in the hdf5 container. Matrix containers are accessed using a notation similiar to traversing a file system. Here the container is being opened in read mode only.

In [4]:
list(f5["/"])

[u'computed', u'ohdsi']

In [5]:
list(f5["/ohdsi/"])

[u'condition_occurrence',
 u'death',
 u'drug_exposure',
 u'identifiers',
 u'measurement',
 u'observation',
 u'observation_period',
 u'person',
 u'procedure_occurrence',
 u'visit_occurrence']

In [6]:
list(f5["/ohdsi/condition_occurrence/"])

[u'column_annotations', u'column_header', u'core_array']

The object stored at `/ohdsi/condition_occurrence/core_array` is a numeric array. The h5py library emulates the API of NumPy. 

In [7]:
f5["/ohdsi/condition_occurrence/core_array"].shape

(66700, 3559)

Lets start by looking at the data stored in the matrix. Each row represents a separate hospitalization.  A `1` indicates that a condition is present. Each column is a separate condition.

In [8]:
f5["/ohdsi/condition_occurrence/core_array"][0:10,:]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [9]:
np.sum(f5["/ohdsi/condition_occurrence/core_array"][0:10,:])

49.0

The array is sparse as it mostly contains `0`. This is not uncommon in health care data that encoded array data is sparse.

In [10]:
conditions_ap = f5["/ohdsi/condition_occurrence/core_array"]

In [11]:
np.sum(conditions_ap) / (conditions_ap.shape[0] * conditions_ap.shape[1])

0.0017294036319856368

In [12]:
f5["/ohdsi/condition_occurrence/column_annotations"][...]

array([['condition_concept', 'condition_concept', 'condition_concept', ...,
        'condition_concept', 'condition_concept', 'condition_concept'],
       ['0', '132344', '132392', ..., '81931', '81942', '81989'],
       ['No matching concept', 'Gingival and periodontal disease',
        'Staphylococcal scalded skin syndrome', ...,
        'Psoriasis with arthropathy',
        'Enthesopathy of ankle AND/OR tarsus',
        'Open wound of upper arm with complication'],
       ['categorical_list', 'categorical_list', 'categorical_list', ...,
        'categorical_list', 'categorical_list', 'categorical_list']],
      dtype='|S128')

Each `core_array` is paired with `column_annotations` which provide human readable label and codes.

In [13]:
conditions = f5["/ohdsi/condition_occurrence/column_annotations"][...]

In [14]:
summed_conditions = np.sum(conditions_ap, 0)

In [15]:
conditions[2, :]

array(['No matching concept', 'Gingival and periodontal disease',
       'Staphylococcal scalded skin syndrome', ...,
       'Psoriasis with arthropathy', 'Enthesopathy of ankle AND/OR tarsus',
       'Open wound of upper arm with complication'],
      dtype='|S128')

In [16]:
np.lexsort((-1 * summed_conditions,))

array([ 796, 1843, 1121, ..., 3540, 3549, 3558], dtype=int64)

In [17]:
conditions_sort_order = np.lexsort((-1 * summed_conditions,))

In [18]:
conditions_sort_order.shape[0]

3559L

In [19]:
condition_names = conditions[2,:]

In [20]:
condition_names[conditions_sort_order]

array(['Type 2 diabetes mellitus',
       'Coronary arteriosclerosis in native artery',
       'Congestive heart failure', ...,
       'Closed traumatic dislocation of knee joint',
       'Traumatic spondylopathy',
       'Open wound of upper arm with complication'],
      dtype='|S128')

In [21]:
summed_conditions[conditions_sort_order]

array([  1.42200000e+04,   1.13600000e+04,   1.11370000e+04, ...,
         1.00000000e+00,   1.00000000e+00,   1.00000000e+00])

In [22]:
summed_conditions[conditions_sort_order] / conditions_ap.shape[0]

array([  2.13193403e-01,   1.70314843e-01,   1.66971514e-01, ...,
         1.49925037e-05,   1.49925037e-05,   1.49925037e-05])

In [23]:
fraction_conditions_sorted = summed_conditions[conditions_sort_order]  / conditions_ap.shape[0]
conditions_names_sorted = condition_names[conditions_sort_order]
paired_conditions_sorted = [(conditions_names_sorted[i], fraction_conditions_sorted[i]) 
                            for i in range(fraction_conditions_sorted.shape[0])]
paired_conditions_sorted[0:20]

[('Type 2 diabetes mellitus', 0.21319340329835082),
 ('Coronary arteriosclerosis in native artery', 0.17031484257871066),
 ('Congestive heart failure', 0.16697151424287857),
 ('Atrial fibrillation', 0.16640179910044978),
 ('Urinary tract infectious disease', 0.14211394302848576),
 ('Gastroesophageal reflux disease', 0.13380809595202398),
 ('Hypertensive renal disease with renal failure', 0.12647676161919041),
 ('Hypothyroidism', 0.11437781109445277),
 ('Anemia', 0.091154422788605693),
 ('Coronary arteriosclerosis', 0.079055472263868068),
 ('Dehydration', 0.073448275862068965),
 ('No matching concept', 0.072983508245877057),
 ('Tobacco dependence syndrome', 0.070704647676161914),
 ('Pure hypercholesterolemia', 0.067901049475262362),
 ('Sepsis', 0.06697151424287856),
 ('Hypokalemia', 0.065847076461769113),
 ('Acute exacerbation of chronic obstructive airways disease',
  0.056596701649175414),
 ('Osteoarthritis', 0.055772113943028487),
 ('Organic mental disorder', 0.055607196401799099),
 

In [24]:
procedures_ap = f5["/ohdsi/procedure_occurrence/core_array"]
procedures = f5["/ohdsi/procedure_occurrence/column_annotations"][...]

In [25]:
procedure_names = procedures[2,:]
procedure_names

array(['Infusion of drotrecogin alfa (activated)',
       'Injection or infusion of nesiritide',
       'Injection or infusion of oxazolidinone class of antibiotics', ...,
       'Attention to tracheostomy', 'Attention to catheter',
       'Implantation of epiretinal visual prosthesis'],
      dtype='|S128')

In [26]:
summed_procedures = np.sum(procedures_ap,0)
summed_procedures

array([  2.,  18.,   9., ...,  65.,   1.,   8.])

In [27]:
procedures_sort_order = np.lexsort((summed_procedures * -1, ))
fraction_procedures_sorted = summed_procedures[procedures_sort_order] / procedures_ap.shape[0]
procedure_names_sorted = procedure_names[procedures_sort_order]
paired_procedures_sorted = [(procedure_names_sorted[i],fraction_procedures_sorted[i])
                            for i in range(fraction_procedures_sorted.shape[0])]
paired_procedures_sorted[0:20]

[('Other diagnostic procedures on lymphatic structures', 0.38083958020989506),
 ('Biopsy of mouth, unspecified structure', 0.19587706146926537),
 ('Other repair of urethra', 0.10338830584707646),
 ('Excision of anus', 0.10263868065967016),
 ('Other resection of rectum', 0.096596701649175415),
 ('Suture of laceration of palate', 0.069070464767616191),
 ('Temporary tracheostomy', 0.056806596701649177),
 ('Splenotomy', 0.043538230884557723),
 ('Other intrathoracic esophagoenterostomy', 0.039925037481259369),
 ('Long-term drug therapy', 0.039085457271364317),
 ('Other gastroenterostomy without gastrectomy', 0.036041979010494753),
 ('Correction of cleft palate', 0.030839580209895051),
 ('Other diagnostic procedures on biliary tract', 0.030779610194902548),
 ('Transfusion of packed cells', 0.027901049475262368),
 ('Rehabilitation therapy', 0.027511244377811096),
 ('Total knee replacement', 0.024677661169415291),
 ('Bone graft, humerus', 0.024122938530734633),
 ('Esophagectomy, not otherwise 

In [28]:
f5["/ohdsi/identifiers/column_annotations"][...]

array([['person_id'],
       [''],
       [''],
       ['integer']],
      dtype='|S128')

In [29]:
np.unique(f5["/ohdsi/identifiers/core_array"][...]).shape[0]

37780L

In [30]:
f5["/ohdsi/identifiers/core_array"][...]

array([[  1.00000000e+00],
       [  2.00000000e+00],
       [  2.00000000e+00],
       ..., 
       [  1.16343000e+05],
       [  1.16346000e+05],
       [  1.16349000e+05]])

In [31]:
f5["/ohdsi/person/column_annotations"][...]

array([['gender_concept_name', 'gender_concept_name', 'race_concept_name',
        'race_concept_name', 'race_concept_name', 'ethnicity_concept_name',
        'ethnicity_concept_name', 'birth_julian_day', 'birth_date'],
       ['FEMALE', 'MALE', 'Black or African American',
        'No matching concept', 'White', 'Hispanic or Latino',
        'Not Hispanic or Latino', '', ''],
       ['', '', '', '', '', '', '', '', ''],
       ['categorical', 'categorical', 'categorical', 'categorical',
        'categorical', 'categorical', 'categorical', 'integer', 'datetime']],
      dtype='|S128')

In [32]:
persons = f5["/ohdsi/person/column_annotations"][...]
gender_index = persons[0,:].tolist().index("gender_concept_name")
persons_af = f5["/ohdsi/person/core_array"]
gender_array = persons_af[:,gender_index]
np.sum(gender_array)

37718.0

In [33]:
f5["/ohdsi/visit_occurrence/column_annotations"][...]

array([['visit_concept_name', 'visit_type_concept_name',
        'age_at_visit_start_in_years_int', 'age_at_visit_start_in_days',
        'visit_start_julian_day', 'visit_end_julian_day',
        'visit_start_datetime', 'visit_end_datetime'],
       ['Inpatient Visit', 'Visit derived from encounter on claim', '', '',
        '', '', '', ''],
       ['', '', '', '', '', '', '', ''],
       ['categorical', 'categorical', 'integer', 'integer', 'integer',
        'integer', 'datetime', 'datetime']],
      dtype='|S128')

In [34]:
visit_occurrences = f5["/ohdsi/visit_occurrence/column_annotations"][...]
age_years_index = visit_occurrences[0,:].tolist().index("age_at_visit_start_in_years_int")
persons_af = f5["/ohdsi/visit_occurrence/core_array"]
age_in_years_array = persons_af[:,age_years_index]
np.unique(np.array(age_in_years_array, dtype="int64"))

array([ 24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,
        37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,
        50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,
        63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,
        76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,
        89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101], dtype=int64)

In [35]:
np.histogram(age_in_years_array,bins=np.arange(0,100))

(array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,   12,   52,   83,  108,   93,   80,   94,   85,  109,
         106,  123,  125,  153,  142,  145,  178,  197,  157,  174,  168,
         258,  312,  367,  330,  312,  343,  356,  351,  397,  344,  390,
         446,  579,  539,  548,  522,  471,  534,  555,  525,  771, 1426,
        2124, 2363, 2325, 2247, 2277, 2269, 2208, 2221, 2289, 2322, 2174,
        2167, 2186, 2247, 2187, 2035, 1882, 1959, 1846, 1685, 1475, 1461,
        1394, 1250,  802,  490,  457,  500,  474,  461,  423,  462,  799], dtype=int64),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
        68, 69,

In [36]:
f5["/computed/next/30_days/visit_occurrence/column_annotations"][...]

array([['visit_concept_name'],
       ['Inpatient Visit'],
       [''],
       ['categorical']],
      dtype='|S128')

In [37]:
np.sum(f5["/computed/next/30_days/visit_occurrence/core_array"])

6421.0

In [38]:
np.sum(f5["/computed/next/30_days/visit_occurrence/core_array"])/\
f5["/computed/next/30_days/visit_occurrence/core_array"].shape[0]

0.096266866566716638

In [39]:
readmissions30_ap = f5["/computed/next/30_days/visit_occurrence/core_array"]
readmissions30_array = readmissions30_ap[:]

In [44]:
domains_of_interest = ["person",
"visit_occurrence",
"condition_occurrence",
"procedure_occurrence",
"measurement/count",
"observation/count"]

domain_dict = {d : "/ohdsi/" + d + "/core_array" for d in domains_of_interest }
domain_dict

{'condition_occurrence': '/ohdsi/condition_occurrence/core_array',
 'measurement/count': '/ohdsi/measurement/count/core_array',
 'observation/count': '/ohdsi/observation/count/core_array',
 'person': '/ohdsi/person/core_array',
 'procedure_occurrence': '/ohdsi/procedure_occurrence/core_array',
 'visit_occurrence': '/ohdsi/visit_occurrence/core_array'}

In [49]:
for domain in domain_dict:
    ca = f5[domain_dict[domain]][...]
    nonzero_ca = np.where(ca > 0)
    print(domain, ca.shape, nonzero_ca[0].shape)

('person', (66700L, 9L), (268370L,))
('condition_occurrence', (66700L, 3559L), (410535L,))
('visit_occurrence', (66700L, 8L), (533600L,))
('procedure_occurrence', (66700L, 1888L), (177437L,))
('observation/count', (66700L, 194L), (32652L,))
('measurement/count', (66700L, 41L), (5551L,))
