CMS (Centers for Medicare & Medicaid Services) has made available a synthetic PUF (Public Use File) of Medicare claims. This data set is known as [SYNPUF](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/). It contains generated fake diagnoses, demographics and procedures for visits. The data is stored in a terse format which is difficult to work with. The data can be transformed using an [ETL script](https://github.com/OHDSI/ETL-CMS/blob/master/python_etl/README.md) [OHDSI](https://www.ohdsi.org/)'s [Common Data Model](https://github.com/OHDSI/CommonDataModel). The tranformed SYNPUF data (version 5) was loaded into a PostGreSQL database server. Using a [mapper script](https://github.com/jhajagos/TransformDBtoHDF5ML) inpatient visits are mapped into separate matrices.

The goal is to build a predictive 30-day inpatient readmission model. In order to build such a model the data needs to be proceesed further. This part of the tutorial explores basic analysis of the synthetic clinical HDF5 container file using [h5py](http://www.h5py.org/) and [NumPy](http://www.numpy.org/). 

In [1]:
import h5py # Library for reading HDF5 files

In [2]:
import numpy as np # Numerical matrix library

In [3]:
f5 = h5py.File("synpuf_inpatient_combined.hdf5", "r")

The `f5` object controls access to the underlying data in the hdf5 container. Matrix containers are accessed using a notation similiar to traversing a file system. Here the container is being opened in read mode only.

In [5]:
list(f5["/"])

[u'computed', u'ohdsi']

In [8]:
list(f5["/ohdsi/"])

[u'condition_occurrence',
 u'death',
 u'drug_exposure',
 u'identifiers',
 u'measurement',
 u'observation',
 u'observation_period',
 u'person',
 u'procedure_occurrence',
 u'visit_occurrence']

In [6]:
    list(f5["/ohdsi/condition_occurrence/"])

[u'column_annotations', u'column_header', u'core_array']

The object stored at `/ohdsi/condition_occurrence/core_array` is a numeric array. The h5py library emulates the API of NumPy. 

In [7]:
f5["/ohdsi/condition_occurrence/core_array"].shape

(66700, 3559)

Lets start by looking at the data stored in the matrix. Each row represents a separate hospitalization.  A `1` indicates that a condition is present. Each column is a separate condition.

In [9]:
f5["/ohdsi/condition_occurrence/core_array"][0:10,:]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [10]:
np.sum(f5["/ohdsi/condition_occurrence/core_array"][0:10,:])

49.0

The array is sparse as it mostly contains `0`. This is not uncommon in health care data that encoded array data is sparse.

In [19]:
np.sum(conditions_pointer) / (conditions_pointer.shape[0] * conditions_pointer.shape[1])

0.0017294036319856368

In [12]:
f5["/ohdsi/condition_occurrence/column_annotations"][...]

array([['condition_concept', 'condition_concept', 'condition_concept', ...,
        'condition_concept', 'condition_concept', 'condition_concept'],
       ['0', '132344', '132392', ..., '81931', '81942', '81989'],
       ['No matching concept', 'Gingival and periodontal disease',
        'Staphylococcal scalded skin syndrome', ...,
        'Psoriasis with arthropathy',
        'Enthesopathy of ankle AND/OR tarsus',
        'Open wound of upper arm with complication'],
       ['categorical_list', 'categorical_list', 'categorical_list', ...,
        'categorical_list', 'categorical_list', 'categorical_list']], 
      dtype='|S128')

Each `core_array` is paired with `column_annotations` which provide human readable label and codes.

In [15]:
conditions = f5["/ohdsi/condition_occurrence/column_annotations"][...]
conditions_pointer = f5["/ohdsi/condition_occurrence/core_array"]

In [26]:
random_rows = np.random.randint(0,conditions_pointer.shape[0],1000)

TypeError: PointSelection __getitem__ only works with bool arrays