CMS (Centers for Medicare & Medicaid Services) has made available a synthetic PUF (Public Use File) of Medicare claims. This data set is known as [SYNPUF](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/). It contains synthetic generated diagnoses, demographics and procedures. A robust [ETL script](https://github.com/OHDSI/ETL-CMS/blob/master/python_etl/README.md) has been developed to map the data in the [OHDSI](https://www.ohdsi.org/)'s [Common Data Model](https://github.com/OHDSI/CommonDataModel). The SYNPUF data was loaded into a PostGreSQL database server using version (5.0) of the Common Data Model. Using a [mapper script](https://github.com/jhajagos/TransformDBtoHDF5ML) the data was mapped to matrices in an HDF5 containter. Only inpatient data was included in the mapped matrices. There are many possible of mappings of data in the OHDSI Common Data Model and there is no single official standardized mapping.

The end result is an HDF5 file which can be used to teach how to analyze hospital inpatient data and demonstrate the process of building a predictive 30-day inpatient readmission model. This part of the tutorial explores basic analysis of the synthetic clinical HDF5 container file using NumPy. 

In [1]:
import h5py # Library for reading HDF5

In [2]:
import numpy as np # Numerical matrix library

In [3]:
f5 = h5py.File("synpuf_inpatient_combined.hdf5", "r")

The `f5` object is your path into exploring the underlying data in an hdf5 container. Matrices are accessed using a notation similiar to traversing a file system. 

In [5]:
list(f5["/"])

[u'computed', u'ohdsi']

In [8]:
list(f5["/ohdsi/"])

[u'condition_occurrence',
 u'death',
 u'drug_exposure',
 u'identifiers',
 u'measurement',
 u'observation',
 u'observation_period',
 u'person',
 u'procedure_occurrence',
 u'visit_occurrence']

In [6]:
list(f5["/ohdsi/condition_occurrence/"])

[u'column_annotations', u'column_header', u'core_array']

In [7]:
f5["/ohdsi/condition_occurrence/core_array"].shape

(66700, 3559)

Lets start by looking at the data stored in the matrix.

In [9]:
f5["/ohdsi/condition_occurrence/core_array"][0:10,:]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [10]:
np.sum(f5["/ohdsi/condition_occurrence/core_array"][0:10,:])

49.0

In [11]:
np.where(f5["/ohdsi/condition_occurrence/core_array"][0:10,:] > 0)

(array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
        5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9,
        9, 9, 9], dtype=int64),
 array([ 322, 1121, 2486, 2522, 3502,  859, 1961, 2524, 3073, 3485, 1228,
        1736, 3319, 1396, 2233, 2920, 3005, 3222,  322, 1281, 1629, 1996,
        2503,  992, 1005, 2029, 3069,  464, 1004, 1005, 1229, 2015, 3067,
         266,  796, 1004, 1080, 1927,  796,  885,  889, 1121, 1542, 2461,
        1111, 1591, 1738, 2522, 3453], dtype=int64))

In [12]:
f5["/ohdsi/condition_occurrence/column_annotations"][...]

array([['condition_concept', 'condition_concept', 'condition_concept', ...,
        'condition_concept', 'condition_concept', 'condition_concept'],
       ['0', '132344', '132392', ..., '81931', '81942', '81989'],
       ['No matching concept', 'Gingival and periodontal disease',
        'Staphylococcal scalded skin syndrome', ...,
        'Psoriasis with arthropathy',
        'Enthesopathy of ankle AND/OR tarsus',
        'Open wound of upper arm with complication'],
       ['categorical_list', 'categorical_list', 'categorical_list', ...,
        'categorical_list', 'categorical_list', 'categorical_list']], 
      dtype='|S128')