# Hippocratech Healthcare Data

_April 15th, 2019_

There're datasets for quarter 4 from 2012 to 2018. We cleaned and combine datasets from 2014 to 2017.

## Table of Contents

<div class='alert alert-block alert-info' style='margin-top: 20px'>
    <li><a href='#ref1'>1. Data Preprocessing</a>
    <li><a href='#ref2'>2. </a>
</div>

<a id='ref1'></a>
## 1. Data Preprocessing

In [2]:
import pandas as pd
import zipfile

In [56]:
def read_data(year, zipname, csvname):
    zf = zipfile.ZipFile('./data/%s' % zipname)
    
    if year in [2017, 2016]:
        data = pd.read_csv(zf.open('%s' % csvname),
                           usecols = ['SiteName_std', 'City_std', 'State_std', 'Zip_std', 
                                      'TrueCountyName', 'desser'], # desser - Designated Service
                           dtype = {'Zip_std': object})
        data.rename(columns={'SiteName_std':'sitename', 'City_std':'city', 'State_std':'state', 
                             'Zip_std':'zip', 'TrueCountyName':'county'}, inplace=True)
        # inplace - Whether to return a new DataFrame. If True then value of copy is ignored.
        data.rename(str.lower, axis='columns', inplace=True)
    
    elif year in [2015, 2014]:
        data = pd.read_csv(zf.open('%s' % csvname),
                           usecols = ['sitename', 'city', 'state', 'zip',
                                      'county', 'desser'], # desser - Designated Service
                           dtype = {'zip': object})
    
    data['year'] = year
    
    return data

The counties New York City are:

005 - Bronx

047 - Kings (Brooklyn)

061 - New York (Manhattan)

081 - Queens

085 - Richmond (Staten Island)

In [57]:
def clean_data(year, data):    
    # 1. New York State
    data_ny = data[data.state == 'NY']
    
    # 2. New York City
    data_ny2 = data_ny.copy()
    if year in [2015, 2014]:
        data_ny2.loc[:,['county']] = data_ny.county.replace({5: 'Bronx', 47: 'Kings', 61: 'New York', 
                                                 81: 'Queens', 85: 'Richmond'})
    
    nyc_counties = ['Bronx', 'Kings', 'New York', 'Queens', 'Richmond']
    data_nyc = data_ny2[data_ny2.county.isin(nyc_counties)]
    
    # 3. Check services and remove pharmacy
    data_noPharma = data_nyc[~(data_nyc.desser == 760)]
    
    return data_noPharma

In [63]:
zip_and_csv = [[2017, 
                'NYSDOH_PNDS_InstitutionalProviderData_2017Q4.zip', 
                'PNDS_Institutional_Q417.csv'],
               [2016, 
                'NYSDOH_PNDS_InstitutionalProviderData_2016_Q04.zip', 
                'PNDS_Institutional_Q416_v2.csv'],
               [2015, 
                'NYSDOH_PNDS_InstitutionalProviderData_2015_Q04.zip', 
                'NYSDOH_PNDS_InstitutionalProviderData_2015_Q04.csv'],
               [2014, 
                'NYSDOH_PNDS_InstitutionalProviderData_2014Q4.zip',
                'NYSDOH_PNDS_InstitutionalProviderData_2014Q4.csv']]

In [86]:
# Combine datasets from 2014 to 2017
%%time
appended_data = []
for i in zip_and_csv:
    readdata = read_data(year=i[0], zipname=i[1], csvname=i[2])
    cleandata = clean_data(year=i[0], data=readdata)
    appended_data.append(cleandata)
    
data14to17 = pd.concat(appended_data, sort=True)

# Remove duplicates
data14to17_noDup = data14to17.drop_duplicates()

CPU times: user 12.9 s, sys: 1.05 s, total: 14 s
Wall time: 12.9 s


In [None]:
# Export to csv file
data14to17_noDup.to_csv('./data/institutional_provider_2014to2017_q4.csv')