This notebook is dedicated to data preparation both for *model preparation* (see the XX notebook) and for *predictions* performed in my app.   
It outputs 5 differents things:

- one local dataset: "prepared_data/Output_local_basis.csv"
- a hospital data set
- several datasets for models
- some dictionnaries used in my app

# 1. Data gathering

## 1.1 Data from the CMS website

In [1]:
import pandas as pd
import os
import numpy as np
import dill

### 1.1.1 The inpatient provider utilization and payment data

The main dataset: **"Medicare Provider Utilization and Payment Data: Inpatient"** for year 2017, which is the most recent year available:   
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient

In [29]:
%%bash
wget https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Inpatient_Data_2017_CSV.zip -nc -q 
unzip -u Inpatient_Data_2017_CSV.zip

Archive:  Inpatient_Data_2017_CSV.zip


In [None]:
%%bash
git add MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
git commit

In [30]:
%%bash
rm Inpatient_Data_2017_CSV.zip

### 1.1.2 The last *hospital compare* flat files package 


The hospital compare dataset (choice: most recent data)
https://data.medicare.gov/data/archives/hospital-compare

Pick up two important files:  
- the **general information file** (with hospital ownership, average ratings...) 
- the **Medicare Hospital Spending per Patient - Hospital.csv** file which gives an assessment of the spending per patient in a price-standardized and risk-adjusted way. https://www.medicare.gov/hospitalcompare/Data/Medicare-Spending-Beneficiary.html


In [36]:
%%bash
wget http://medicare.gov/download/HospitalCompare/2020/July/HOSArchive_Revised_Flatfiles_20200731.zip -nc -q 


In [37]:
if not os.path.exists('hospital_compare_data'):
    os.makedirs('hospital_compare_data')

In [38]:
%%bash
cp HOSArchive_Revised_Flatfiles_20200731.zip hospital_compare_data 
rm HOSArchive_Revised_Flatfiles_20200731.zip

In [39]:
%%bash
cd hospital_compare_data
unzip HOSArchive_Revised_Flatfiles_20200731.zip 
ls

Archive:  HOSArchive_Revised_Flatfiles_20200731.zip
  inflating: ASC_CCN_pr19q1_19q4.csv  
  inflating: ASC_Facility.csv        
  inflating: ASC_National.csv        
  inflating: ASC_NATIONAL_pr19q1_19q4.csv  
  inflating: ASC_State.csv           
  inflating: ASC_STATE_pr19q1_19q4.csv  
  inflating: CJR PY4 Quality Reporting_July 2020_Production File.csv  
  inflating: CMS_PSI_6_decimal_file.csv  
  inflating: Complications and Deaths - Hospital.csv  
  inflating: Complications and Deaths - National.csv  
  inflating: Complications and Deaths - State.csv  
  inflating: Footnote Crosswalk.csv  
  inflating: footnotes_deliver_19q1_19q4.csv  
  inflating: FY2018_Distribution_of_Net_Change_in_Base_Op_DRG_Payment_Amt_2019-11-22.csv  
  inflating: FY2018_Net_Change_in_Base_Op_DRG_Payment_Amt_2019-11-22.csv  
  inflating: FY2018_Percent_Change_in_Medicare_Payments_2019-11-22.csv  
  inflating: FY2018_Value_Based_Incentive_Payment_Amount_2019-11-22.csv  
  inflating: HCAHPS - Hospital.csv   

In [40]:
%%bash
cd hospital_compare_data
mv 'Hospital General Information.csv' Hospital_General_Information.csv
mv 'Medicare Hospital Spending per Patient - Hospital.csv' Medicare_Hospital_spending_per_patient_Hospital.csv

In [41]:
%%bash
cp hospital_compare_data/Hospital_General_Information.csv .
cp hospital_compare_data/Medicare_Hospital_spending_per_patient_Hospital.csv .

ls


Data wrangling.ipynb
Hospital_General_Information.csv
MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
Medicare_Hospital_spending_per_patient_Hospital.csv
hospital_compare_data


In [42]:
%%bash
rm -r hospital_compare_data
ls

Data wrangling.ipynb
Hospital_General_Information.csv
MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
Medicare_Hospital_spending_per_patient_Hospital.csv


In [43]:
%%bash
git add Hospital_General_Information.csv Medicare_Hospital_spending_per_patient_Hospital.csv
git commit

[master 7f34862] Add two additionnal files from CMS: Hospital_General_Information.csv and Medicare_Hospital_spending_per_patient_Hospital.csv
 2 files changed, 10034 insertions(+)
 create mode 100644 raw_inputs/Hospital_General_Information.csv
 create mode 100644 raw_inputs/Medicare_Hospital_spending_per_patient_Hospital.csv


## 1.2 "Local" Data

Download:

- a zipcode - HRR (hospital referral region) crosswalk table from the Dartmouth Atlas
https://atlasdata.dartmouth.edu/downloads/geography/ZipHsaHrr17.xls    
- a crosswalk between zipcodes and counties https://www.huduser.gov/portal/datasets/usps_crosswalk.html (done manually)  
- IRS revenue data at county level: https://www.irs.gov/pub/irs-soi/17incyallnoagi.csv
- census data: cc-est2017-alldata.csv from (

In [46]:
%%bash
wget https://atlasdata.dartmouth.edu/downloads/geography/ZipHsaHrr17.xls  -q -nc

In [44]:
#move from download
os.rename("/Users/camilledethe/downloads/ZIP_COUNTY_122017.xlsx", "ZIP_COUNTY_122017.xlsx")


In [47]:
%%bash
wget https://www.irs.gov/pub/irs-soi/17incyallnoagi.csv  -q -nc

In [49]:
os.rename("/Users/camilledethe/downloads/us-zip-code-latitude-and-longitude.csv", "us-zip-code-latitude-and-longitude.csv")

In [82]:
#move from download
os.rename("/Users/camilledethe/downloads/cc-est2017-alldata.csv", "cc-est2017-alldata.csv")

In [118]:
%%bash 
git add 17incyallnoagi.csv "Data wrangling.ipynb" ZIP_COUNTY_122017.xlsx ZipHsaHrr17.xls cc-est2017-alldata.csv us-zip-code-latitude-and-longitude.csv

In [119]:
%%bash 
git commit

[master 0fae6a4] add all raw files for the "local" data set (including IRS, census, crosswalks files)
 6 files changed, 644392 insertions(+)
 create mode 100644 raw_inputs/17incyallnoagi.csv
 create mode 100644 raw_inputs/Data wrangling.ipynb
 create mode 100644 raw_inputs/ZIP_COUNTY_122017.xlsx
 create mode 100644 raw_inputs/ZipHsaHrr17.xls
 create mode 100644 raw_inputs/cc-est2017-alldata.csv
 create mode 100644 raw_inputs/us-zip-code-latitude-and-longitude.csv



# 2. Data preparation


## 2.1 Preparation of the "local" dataset

It's important not to destroy any information. So need 

### 2.1.1 Preparation of the basic structures: zipcode, longitude, latitude, county, HRR

missing zipcode 01040, 97471,85209
zip_coun = 39455 unique zipcode

In [215]:
len(missing_zipcode)

231

In [195]:
zip_coun=pd.read_excel('ZIP_COUNTY_122017.xlsx',usecols=[0,1],skiprows=0,dtype={'zip':str,'county':str})

In [196]:
#Pb for a couple of zipcodes we have several counties, select just the first one
zip_coun=zip_coun.groupby('zip').county.agg(['first']).reset_index()

In [197]:
zip_coun.rename(columns={'first':'county'}, inplace=True)

In [200]:
zip_coun.shape

(39455, 2)

In [276]:
zip_coun[zip_coun.zip.isin(['10065','13503'])]

Unnamed: 0,zip,county
3117,10065,36061


In [198]:
long_lat=pd.read_csv('us-zip-code-latitude-and-longitude.csv',sep=';', dtype={'Zip':str}, 
                     usecols=['Zip','City','State','Latitude','Longitude'])

In [211]:
long_lat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43191 entries, 0 to 43190
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Zip        43191 non-null  object 
 1   City       43191 non-null  object 
 2   State      43191 non-null  object 
 3   Latitude   43191 non-null  float64
 4   Longitude  43191 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.6+ MB


In [277]:
long_lat[long_lat.Zip.isin(['10065','13503'])]

Unnamed: 0,Zip,City,State,Latitude,Longitude
2215,13503,Utica,NY,43.101869,-75.231158


In [225]:
local_1=zip_coun.merge(long_lat, left_on='zip',right_on='Zip')

In [226]:
local_1[local_1.zip.isin(missing_zipcode)]

Unnamed: 0,zip,county,Zip,City,State,Latitude,Longitude
203,01040,25013,01040,Holyoke,MA,42.201891,-72.62420
211,01060,25015,01060,Northampton,MA,42.324539,-72.63561
217,01069,25015,01069,Palmer,MA,42.176401,-72.32646
231,01085,25013,01085,Westfield,MA,42.133642,-72.75029
244,01104,25013,01104,Springfield,MA,42.130343,-72.57338
...,...,...,...,...,...,...,...
3028,08861,34023,08861,Perth Amboy,NJ,40.520105,-74.27708
3031,08865,34041,08865,Phillipsburg,NJ,40.689123,-75.17243
3039,08876,34035,08876,Somerville,NJ,40.545853,-74.63592
3050,08901,34023,08901,New Brunswick,NJ,40.488304,-74.44775


In [227]:
zip_hrr=pd.read_excel('ZipHsaHrr17.xls',dtype={'zipcode2017':str,'county':str})


In [238]:
zip_hrr['zipcode2017']=zip_hrr['zipcode2017'].apply(lambda x : x.zfill(5))
zip_hrr

Unnamed: 0,zipcode2017,hsanum,hsacity,hsastate,hrrnum,hrrcity,hrrstate,zip_pad
0,00501,33095,Patchogue,NY,301,East Long Island,NY,00501
1,00544,33095,Patchogue,NY,301,East Long Island,NY,00544
2,01001,22058,Springfield,MA,230,Springfield,MA,01001
3,01002,22046,Northampton,MA,230,Springfield,MA,01002
4,01003,22046,Northampton,MA,230,Springfield,MA,01003
...,...,...,...,...,...,...,...,...
40868,99926,2006,Ketchikan,AK,10,Anchorage,AK,99926
40869,99927,2006,Ketchikan,AK,10,Anchorage,AK,99927
40870,99928,2006,Ketchikan,AK,10,Anchorage,AK,99928
40871,99929,2015,Wrangell,AK,10,Anchorage,AK,99929


In [239]:
local_2=zip_hrr.merge(local_1,left_on='zipcode2017',right_on='zip')

In [240]:
local_2[local_2.zipcode2017.isin(missing_zipcode)]

Unnamed: 0,zipcode2017,hsanum,hsacity,hsastate,hrrnum,hrrcity,hrrstate,zip_pad,zip,county,Zip,City,State,Latitude,Longitude
31,01040,22023,Holyoke,MA,230,Springfield,MA,01040,01040,25013,01040,Holyoke,MA,42.201891,-72.62420
39,01060,22046,Northampton,MA,230,Springfield,MA,01060,01060,25015,01060,Northampton,MA,42.324539,-72.63561
45,01069,22049,Palmer,MA,230,Springfield,MA,01069,01069,25015,01069,Palmer,MA,42.176401,-72.32646
59,01085,22066,Westfield,MA,230,Springfield,MA,01085,01085,25013,01085,Westfield,MA,42.133642,-72.75029
72,01104,22058,Springfield,MA,230,Springfield,MA,01104,01104,25013,01104,Springfield,MA,42.130343,-72.57338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2829,08822,31014,Flemington,NJ,356,Philadelphia,PA,08822,08822,34019,08822,Flemington,NJ,40.515645,-74.85319
2856,08861,31042,Perth Amboy,NJ,288,New Brunswick,NJ,08861,08861,34023,08861,Perth Amboy,NJ,40.520105,-74.27708
2859,08865,31043,Phillipsburg,NJ,346,Allentown,PA,08865,08865,34041,08865,Phillipsburg,NJ,40.689123,-75.17243
2867,08876,31056,Somerville,NJ,288,New Brunswick,NJ,08876,08876,34035,08876,Somerville,NJ,40.545853,-74.63592


In [241]:
local_2=local_2[['zipcode2017','hrrnum', 'county','City','State','Latitude','Longitude']]

### 2.1.2 Adjunction of IRS data at county level

In [242]:
df2=pd.read_csv('17incyallnoagi.csv', usecols=['STATEFIPS','COUNTYFIPS','N00200','A00200','A00100','N1','N2'],dtype={'STATEFIPS':str,'COUNTYFIPS':str ,'CBSACODE':str})



In [243]:

df2['COUNTY_FULL_FIPS']=df2['STATEFIPS']+df2['COUNTYFIPS']

irs_2=df2.groupby('COUNTY_FULL_FIPS').sum().reset_index()

irs_2['average_AGI_c']=irs_2.A00100/irs_2.N1
irs_2['average_wage_c']=irs_2.A00200  /irs_2.N00200

irs_2=irs_2[['COUNTY_FULL_FIPS','average_AGI_c','average_wage_c']]

In [244]:
local_3=local_2.merge(irs_2, left_on='county',right_on='COUNTY_FULL_FIPS')

In [245]:
local_3.drop(columns=['COUNTY_FULL_FIPS'],inplace=True)

### 2.1.3 Adjunction of Census data at county level

In [246]:
dfc=pd.read_csv('cc-est2017-alldata.csv',encoding='latin-1', 
            usecols=['STATE','COUNTY','YEAR','AGEGRP','TOT_POP', 'BA_MALE', 'BA_FEMALE',  'H_MALE','H_FEMALE'],
            dtype= {'STATE':str,'COUNTY':str}   )


In [247]:
#Select only the lines for 2017
dfc=dfc[dfc['YEAR']==10]

In [248]:
#recreate the county
dfc['COUNTY_ID']=dfc['STATE']+dfc['COUNTY']


In [249]:
dfc['BA']=dfc['BA_MALE']+dfc['BA_FEMALE']
dfc['H']=dfc['H_MALE']+dfc['H_FEMALE']


In [250]:



dfc['over_65']=0
dfc.loc[dfc['AGEGRP']>13,'over_65']=dfc['TOT_POP']

# build the proportion of people above 65 ie agegrp 14,15,16,17,18
dfc=dfc.groupby('COUNTY_ID').agg(
                pop_tot=pd.NamedAgg(column='TOT_POP', aggfunc='max'),
                pop_ba=pd.NamedAgg(column='BA', aggfunc='max'),
                pop_h=pd.NamedAgg(column='H', aggfunc='max'),
                pop_over_65=pd.NamedAgg(column='over_65', aggfunc=np.sum ))
   

dfc=dfc.reset_index()

dfc['per_over_65']=dfc.pop_over_65/dfc.pop_tot
dfc['per_ba']=dfc.pop_ba/dfc.pop_tot
dfc['per_h']=dfc.pop_h/dfc.pop_tot

dfc=dfc[['COUNTY_ID','per_over_65','per_ba','per_h']]

In [251]:
local = local_3.merge(dfc,left_on='county',right_on='COUNTY_ID' )

In [253]:
local.head()

Unnamed: 0,zipcode2017,hrrnum,county,City,State,Latitude,Longitude,average_AGI_c,average_wage_c,COUNTY_ID,per_over_65,per_ba,per_h
0,501,301,36103,Holtsville,NY,40.922326,-72.637078,89.476674,73.885973,36103,0.163743,0.085894,0.195437
1,6390,111,36103,Fishers Island,NY,41.261936,-72.00708,89.476674,73.885973,36103,0.163743,0.085894,0.195437
2,11702,301,36103,Babylon,NY,40.687649,-73.32549,89.476674,73.885973,36103,0.163743,0.085894,0.195437
3,11703,301,36103,North Babylon,NY,40.733398,-73.32257,89.476674,73.885973,36103,0.163743,0.085894,0.195437
4,11704,301,36103,West Babylon,NY,40.719249,-73.35829,89.476674,73.885973,36103,0.163743,0.085894,0.195437


In [254]:
local[local.zipcode2017.isin(missing_zipcode)]

Unnamed: 0,zipcode2017,hrrnum,county,City,State,Latitude,Longitude,average_AGI_c,average_wage_c,COUNTY_ID,per_over_65,per_ba,per_h
121,01040,230,25013,Holyoke,MA,42.201891,-72.624200,59.715230,49.727145,25013,0.164700,0.107965,0.252553
130,01085,230,25013,Westfield,MA,42.133642,-72.750290,59.715230,49.727145,25013,0.164700,0.107965,0.252553
137,01104,230,25013,Springfield,MA,42.130343,-72.573380,59.715230,49.727145,25013,0.164700,0.107965,0.252553
155,01199,230,25013,Springfield,MA,42.119943,-72.604983,59.715230,49.727145,25013,0.164700,0.107965,0.252553
171,01060,230,25015,Northampton,MA,42.324539,-72.635610,71.770153,59.584764,25015,0.165608,0.033658,0.056453
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2923,08210,283,34009,Cape May Court House,NJ,39.081754,-74.836580,64.498280,48.672040,34009,0.256336,0.049630,0.078298
2964,08534,356,34021,Pennington,NJ,40.323150,-74.783640,97.825045,80.590870,34021,0.147694,0.213870,0.177895
2985,08629,356,34021,Trenton,NJ,40.219358,-74.733340,97.825045,80.590870,34021,0.147694,0.213870,0.177895
2986,08638,356,34021,Trenton,NJ,40.249908,-74.759530,97.825045,80.590870,34021,0.147694,0.213870,0.177895


In [126]:

if not os.path.exists('../prepared_data'):
    os.makedirs('../prepared_data')

In [267]:
local.to_csv(path_or_buf='../prepared_data/Output_local_basis.csv',index=False)

In [129]:
%%bash
git add ../prepared_data/Output_local_basis.csv

In [273]:
%%bash
git add 'Data wrangling.ipynb'
git commit

[master e180a78] Modification of the data wrangling notebook: work on the data preparatio n of i) hospital_full,  ii) local_output_data and iii) inpatients and their respective combinations
 1 file changed, 2154 insertions(+), 105 deletions(-)


## 2.2 Preparation of the hospital dataset and the DRG dictionnaries



**1. The inpatient dataset**

- 196325 observations on 7 million discharges


- 3182 providers ie hospitals (3181 avec filtration of DRG with less than 8 different hospitals)  
301 have 5 observations or less,   
552 have 10 or less observations


- 563 DRG (440 after filtrations)   
106 have 5 or less observations  
149 have 10 or less observations  
256 have 50 or less observations


In [305]:
inpatient=pd.read_csv('MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.csv',
                      dtype={'Provider Id':str,'Provider Zip Code':str },
                     usecols=['DRG Definition', 'Provider Id','Provider State', 'Provider Zip Code',
       'Total Discharges', 'Average Covered Charges', 'Average Total Payments',
       'Average Medicare Payments'])

inpatient.columns=['DRG', 'prov_id','Provider State','prov_zip',
       'total_discharges', 'average_covered_charges', 'average_total_payments',
       'average_medicare_payments']

In [306]:
inpatient.prov_id.apply(lambda x : len(x))

0         5
1         5
2         5
3         5
4         5
         ..
196320    6
196321    6
196322    6
196323    6
196324    6
Name: prov_id, Length: 196325, dtype: int64

In [307]:
#Treat the column prov_id
#inpatient['prov_id']=inpatient['prov_id'].apply(lambda x : '0'+x if len(x)==5 else x)
inpatient['prov_id']=inpatient['prov_id'].apply(lambda x : '0'+x if len(x)==5 else x)


In [308]:
inpatient.to_csv(path_or_buf='../prepared_data/Inpatient.csv',index=False)

In [281]:
%%bash
git add '../prepared_data/Inpatient.csv'
git commit -m 'add the Inpatient dataset in the prepared_data folder'

[master ca21a16] add the Inpatient dataset in the prepared_data folder
 1 file changed, 196326 insertions(+)
 create mode 100644 prepared_data/Inpatient.csv


**2. The hospital dataset**

Merge of the hospital general information and the average spending per medicare patient dataset.   
Select only Acute Care Hospital in the 50 US states.   
We decided to keep only hospitals with all local information available. 37 hospitals have been lost due to missing local information (either longitude / latitude or HRR). 


In [309]:
hosp=pd.read_csv('Hospital_General_Information.csv' ,
                dtype={'Facility ID':str,'ZIP Code' : str}
                 ,
                 usecols=['Facility ID','Facility Name','ZIP Code', 'State',
                          'Hospital Type',  'Hospital Ownership','Emergency Services', 
                          'Hospital overall rating', 'Hospital overall rating footnote'] 
                )


In [310]:
hosp=hosp[~hosp.State.isin(['AS', 'GU','MP','PR','VI'])]

In [311]:
hosp.columns=['Facility ID','hosp_name','state','zipcode','hosp_type','hosp_ownership',
       'hosp_emergency_services',
       'hosp_rating', 'hosp_rating_fn']

In [312]:
med_spending=pd.read_csv('Medicare_Hospital_spending_per_patient_Hospital.csv'
                         , 
                         usecols=['Facility ID','Score','Footnote'], 
                         dtype={'Facility ID':str}
                        )
med_spending.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4712 entries, 0 to 4711
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Facility ID  4712 non-null   object 
 1   Score        4712 non-null   object 
 2   Footnote     1651 non-null   float64
dtypes: float64(1), object(2)
memory usage: 110.6+ KB


In [313]:
hosp_full=med_spending.merge(hosp,how='inner', on = 'Facility ID')

In [314]:
hosp_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4653 entries, 0 to 4652
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Facility ID              4653 non-null   object 
 1   Score                    4653 non-null   object 
 2   Footnote                 1592 non-null   float64
 3   hosp_name                4653 non-null   object 
 4   state                    4653 non-null   object 
 5   zipcode                  4653 non-null   object 
 6   hosp_type                4653 non-null   object 
 7   hosp_ownership           4653 non-null   object 
 8   hosp_emergency_services  4653 non-null   object 
 9   hosp_rating              4653 non-null   object 
 10  hosp_rating_fn           1146 non-null   float64
dtypes: float64(2), object(9)
memory usage: 436.2+ KB


In [315]:
hosp_full=hosp_full[hosp_full.hosp_type=='Acute Care Hospitals']

In [316]:
full_local_hosp_temp=local.merge(hosp_full,how='outer', left_on='zipcode2017',right_on='zipcode')

In [317]:
missing_zipcode=full_local_hosp[full_local_hosp['Facility ID'].notna() & full_local_hosp.zipcode2017.isna()].sort_values(by=['zipcode']).zipcode

In [318]:
hosp_full=hosp_full[~hosp_full.zipcode.isin(missing_zipcode)]

In [321]:
hosp_full.to_csv(path_or_buf='../prepared_data/hospitals.csv', index=False)

In [298]:
%%bash
git add '../prepared_data/hospitals.csv'
git commit -m 'The hospitals dataset is modif'

[master 1894f37] The hospitals dataset is added
 1 file changed, 3169 insertions(+)
 create mode 100644 prepared_data/hospitals.csv


## 3. Description of the merged data

We merge all available information

In [322]:
local_hosp_temp=local.merge(hosp_full,how='inner', left_on='zipcode2017',right_on='zipcode')

In [323]:
local_hosp_temp

Unnamed: 0,zipcode2017,hrrnum,county,City,State,Latitude,Longitude,average_AGI_c,average_wage_c,COUNTY_ID,...,Score,Footnote,hosp_name,state,zipcode,hosp_type,hosp_ownership,hosp_emergency_services,hosp_rating,hosp_rating_fn
0,11706,301,36103,Bay Shore,NY,40.729098,-73.25607,89.476674,73.885973,36103,...,1.06,,NS/LIJ HS SOUTHSIDE HOSPITAL,NY,11706,Acute Care Hospitals,Voluntary non-profit - Private,Yes,3,
1,11743,301,36103,Huntington,NY,40.867498,-73.41146,89.476674,73.885973,36103,...,1.01,,NS/LIJ HS HUNTINGTON HOSPITAL,NY,11743,Acute Care Hospitals,Voluntary non-profit - Private,Yes,4,
2,11772,301,36103,Patchogue,NY,40.770898,-73.00213,89.476674,73.885973,36103,...,1.10,,LONG ISLAND COMMUNITY HOSPITAL,NY,11772,Acute Care Hospitals,Voluntary non-profit - Private,Yes,1,
3,11777,301,36103,Port Jefferson,NY,40.946103,-73.06222,89.476674,73.885973,36103,...,1.04,,JOHN T MATHER MEMORIAL HOSPITAL OF PORT JEFFE...,NY,11777,Acute Care Hospitals,Voluntary non-profit - Private,Yes,2,
4,11777,301,36103,Port Jefferson,NY,40.946103,-73.06222,89.476674,73.885973,36103,...,1.03,,ST CHARLES HOSPITAL,NY,11777,Acute Care Hospitals,Voluntary non-profit - Private,Yes,3,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3163,99559,10,02050,Bethel,AK,60.766603,-161.88006,36.871655,35.371613,02050,...,0.80,,YUKON KUSKOKWIM DELTA REG HOSPITAL,AK,99559,Acute Care Hospitals,Tribal,Yes,3,17.0
3164,99669,10,02122,Soldotna,AK,60.489536,-151.02091,64.800247,54.978382,02122,...,0.90,,CENTRAL PENINSULA GENERAL HOSPITAL,AK,99669,Acute Care Hospitals,Voluntary non-profit - Other,Yes,3,
3165,99645,10,02170,Palmer,AK,61.598203,-149.04109,65.127874,59.067824,02170,...,0.79,,MAT-SU REGIONAL MEDICAL CENTER,AK,99645,Acute Care Hospitals,Voluntary non-profit - Other,Yes,3,
3166,99701,10,02090,Fairbanks,AK,64.835070,-147.72045,66.448590,55.822267,02090,...,0.80,,FAIRBANKS MEMORIAL HOSPITAL,AK,99701,Acute Care Hospitals,Voluntary non-profit - Private,Yes,4,


In [333]:
local_hosp_temp2=local_hosp_temp.merge(inpatient,how='left', left_on='Facility ID',right_on='prov_id')

In [334]:
local_hosp_temp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3168 entries, 0 to 3167
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   zipcode2017              3168 non-null   object 
 1   hrrnum                   3168 non-null   int64  
 2   county                   3168 non-null   object 
 3   City                     3168 non-null   object 
 4   State                    3168 non-null   object 
 5   Latitude                 3168 non-null   float64
 6   Longitude                3168 non-null   float64
 7   average_AGI_c            3168 non-null   float64
 8   average_wage_c           3168 non-null   float64
 9   COUNTY_ID                3168 non-null   object 
 10  per_over_65              3168 non-null   float64
 11  per_ba                   3168 non-null   float64
 12  per_h                    3168 non-null   float64
 13  Facility ID              3168 non-null   object 
 14  Score                   

In [335]:
data_shema=local_hosp_temp2[['Facility ID','Footnote','prov_id']].drop_duplicates()
data_shema.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3168 entries, 0 to 192058
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Facility ID  3168 non-null   object 
 1   Footnote     144 non-null    float64
 2   prov_id      3044 non-null   object 
dtypes: float64(1), object(2)
memory usage: 99.0+ KB


In [338]:
print(data_shema[data_shema['prov_id'].notna() & data_shema.Footnote.isna()].shape[0],
data_shema[data_shema['prov_id'].isna() & data_shema.Footnote.isna()].shape[0],
data_shema[data_shema['prov_id'].notna() & data_shema.Footnote.notna()].shape[0],
data_shema[data_shema['prov_id'].isna() & data_shema.Footnote.notna()].shape[0])

2985 39 59 85


# 4. Filtering data & preparation of dictionnaries and lists for the app

CMS_dic
healthcare_simulator/app/CMS2DRG_dic.pkd', table de passage d'un CMS vers un dictionnaire avec DRG_text vers DRG_num
dic_num_to_DRG
list_DRG


CMS_dic n'est pas filtré, il faut peut être revoir

In [7]:
%%bash
ls ../prepared_data

Book3.xlsx
Inpatient.csv
Output_local_basis.csv
est_1.pkd
est_2.pkd
est_3.pkd
est_4.pkd
hospitals.csv
~$Book3.xlsx


In [8]:
inpatient=pd.read_csv('../prepared_data/Inpatient.csv')

In [9]:
#Count the number of hospitals where we have a record of this specific DRG
diag=inpatient.groupby('DRG')['total_discharges'].agg(['count', 'sum']).reset_index()

In [10]:
#set the threshold to 10, we need to have at least prices in 10 different hospital to include the DRG in the app 
c=10

In [11]:
#filter the DRG
diag=diag[diag['count']>=c]

In [12]:
#Extract the number in the DRG
diag['DRG_num']=diag['DRG'].str[:3].astype(int)
diag['DRG_text']=diag['DRG'].str[6:]


In [13]:
CMS_dic={0:'Transplants and tracheostomy',
            1:'Nervous System',
            2:'Eye',
            3:'Ear, Nose, Mouth And Throat',
            4:'Respiratory System',
            5:'Circulatory System',
            6:'Digestive System',
            7:'Hepatobiliary System And Pancreas',
            8:'Musculoskeletal System And Connective Tissue',
            9:'Skin, Subcutaneous Tissue And Breast',
            10:'Endocrine, Nutritional And Metabolic System',
            11:'Kidney And Urinary Tract',
            12:'Male Reproductive System',
            13:'Female Reproductive System',
            14:'Pregnancy, Childbirth And Puerperium',
            15:'Newborn And Other Neonates (Perinatal Period)',
            16:'Blood and Blood Forming Organs and Immunological Disorders',
            17:'Myeloproliferative DDs (Poorly Differentiated Neoplasms)',
            18:'Infectious and Parasitic DDs',
            19:'Mental Diseases and Disorders',
            20:'Alcohol/Drug Use or Induced Mental Disorders',
            21:'Injuries, Poison And Toxic Effect of Drugs',
            22:'Burns',
            23:'Factors Influencing Health Status',
            24:'Multiple Significant Trauma',
            25: 'Human Immunodeficiency Virus Infection',
            26:'Non MDC'}

In [14]:
MDC_DRG_crosswalk=[(0,1,13),(1,20,103),(2,113,125),(3,129,159),(4,163,208),
                   (5,215,316),(6,326,395),(7,405,446),(8,453,566),(9,573,607),(10,614,645),
                  (11,652,700),(12,707,730),(13,734,761),(14,765,782),(15,789,795),(16,799,816),
                  (17,820,849),(18,853,872),(19,876,887),(20,894,897),(21,901,923),(22,927,935),
                  (23,939,951),(24,955,965),(25,969,977),(26,981,999)]


In [15]:
#Build a column with the CMS
for t in MDC_DRG_crosswalk:
    diag[str(t[0])]=diag['DRG_num'].apply(lambda x: t[0] if x >= t[1] and x<=t[2] else 0)

diag['MDC']=0
for t in MDC_DRG_crosswalk:
    diag['MDC']+=diag[str(t[0])]
    diag.drop(columns=[str(t[0])], inplace=True)

diag.head()

Unnamed: 0,DRG,count,sum,DRG_num,DRG_text,MDC
0,001 - HEART TRANSPLANT OR IMPLANT OF HEART ASS...,81,1955,1,HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SY...,0
1,003 - ECMO OR TRACH W MV >96 HRS OR PDX EXC FA...,421,11695,3,"ECMO OR TRACH W MV >96 HRS OR PDX EXC FACE, MO...",0
2,"004 - TRACH W MV >96 HRS OR PDX EXC FACE, MOUT...",431,8328,4,"TRACH W MV >96 HRS OR PDX EXC FACE, MOUTH & NE...",0
3,005 - LIVER TRANSPLANT W MCC OR INTESTINAL TRA...,53,1080,5,LIVER TRANSPLANT W MCC OR INTESTINAL TRANSPLANT,0
4,006 - LIVER TRANSPLANT W/O MCC,10,143,6,LIVER TRANSPLANT W/O MCC,0


In [16]:
pd.options.display.max_colwidth = 120
diag.head(60)


Unnamed: 0,DRG,count,sum,DRG_num,DRG_text,MDC
0,001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC,81,1955,1,HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC,0
1,"003 - ECMO OR TRACH W MV >96 HRS OR PDX EXC FACE, MOUTH & NECK W MAJ O.R.",421,11695,3,"ECMO OR TRACH W MV >96 HRS OR PDX EXC FACE, MOUTH & NECK W MAJ O.R.",0
2,"004 - TRACH W MV >96 HRS OR PDX EXC FACE, MOUTH & NECK W/O MAJ O.R.",431,8328,4,"TRACH W MV >96 HRS OR PDX EXC FACE, MOUTH & NECK W/O MAJ O.R.",0
3,005 - LIVER TRANSPLANT W MCC OR INTESTINAL TRANSPLANT,53,1080,5,LIVER TRANSPLANT W MCC OR INTESTINAL TRANSPLANT,0
4,006 - LIVER TRANSPLANT W/O MCC,10,143,6,LIVER TRANSPLANT W/O MCC,0
5,007 - LUNG TRANSPLANT,22,414,7,LUNG TRANSPLANT,0
7,"011 - TRACHEOSTOMY FOR FACE,MOUTH & NECK DIAGNOSES W MCC",39,603,11,"TRACHEOSTOMY FOR FACE,MOUTH & NECK DIAGNOSES W MCC",0
8,"012 - TRACHEOSTOMY FOR FACE,MOUTH & NECK DIAGNOSES W CC",53,986,12,"TRACHEOSTOMY FOR FACE,MOUTH & NECK DIAGNOSES W CC",0
10,014 - ALLOGENEIC BONE MARROW TRANSPLANT,31,560,14,ALLOGENEIC BONE MARROW TRANSPLANT,0
11,016 - AUTOLOGOUS BONE MARROW TRANSPLANT W CC/MCC,70,1745,16,AUTOLOGOUS BONE MARROW TRANSPLANT W CC/MCC,0


In [18]:
#Check that every CMS has some DRG
len(diag.MDC.value_counts())

26

In [None]:
cms2drg=dict()
for m in diag['MDC'].value_counts().index:
    #select only data from this MDC then
   
    cms2drg[m]=dict(zip(diag[diag['MDC']==m].DRG_text,diag[diag['MDC']==m].DRG_num))
cms2drg   

#crosswalk between CMS and (filtered) DRG. Dictionnary of dictionnaries
dill.dump(cms2drg,open('healthcare_simulator/app/CMS2DRG_dic.pkd', 'wb'))   

diag['DRG'].value_counts().index

list_DRG=list(diag['DRG'].value_counts().index)
dill.dump(list_DRG,open('healthcare_simulator/app/list_DRG.pkd', 'wb'))   

dic_num_to_DRG=dict(zip([int(l[:3]) for l in list_DRG],list_DRG))
dill.dump(dic_num_to_DRG,open('healthcare_simulator/app/dic_num_to_DRG.pkd', 'wb'))   

sorted(dic_num_to_DRG.keys())



**save all the lists and dictionnaries**

In [None]:
dill.dump(CMS_dic, open('healthcare_simulator/app/CMS_dic.pkd', 'wb'))     

In [339]:
%%bash 
git add ../prepared_data


In [340]:
%%bash
git commit -m 'modified data for local, hospitals and inpatients'

[master cf87ecc] modified data for local, hospitals and inpatients
 1 file changed, 31281 insertions(+), 31281 deletions(-)
