This notebook is dedicated to data preparation both for *model preparation* (see the XX notebook) and for *predictions* performed in my app.   
It outputs 5 differents things:

- one local dataset: "prepared_data/Output_local_basis.csv"
- a hospital data set
- several datasets for models
- some dictionnaries used in my app

# 1. Data gathering

## 1.1 Data from the CMS website

In [1]:
import pandas as pd
import os
import numpy as np

### 1.1.1 The inpatient provider utilization and payment data

The main dataset: **"Medicare Provider Utilization and Payment Data: Inpatient"** for year 2017, which is the most recent year available:   
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient

In [29]:
%%bash
wget https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Inpatient_Data_2017_CSV.zip -nc -q 
unzip -u Inpatient_Data_2017_CSV.zip

Archive:  Inpatient_Data_2017_CSV.zip


In [None]:
%%bash
git add MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
git commit

In [30]:
%%bash
rm Inpatient_Data_2017_CSV.zip

### 1.1.2 The last *hospital compare* flat files package 


The hospital compare dataset (choice: most recent data)
https://data.medicare.gov/data/archives/hospital-compare

Pick up two important files:  
- the **general information file** (with hospital ownership, average ratings...) 
- the **Medicare Hospital Spending per Patient - Hospital.csv** file which gives an assessment of the spending per patient in a price-standardized and risk-adjusted way. https://www.medicare.gov/hospitalcompare/Data/Medicare-Spending-Beneficiary.html


In [36]:
%%bash
wget http://medicare.gov/download/HospitalCompare/2020/July/HOSArchive_Revised_Flatfiles_20200731.zip -nc -q 


In [37]:
if not os.path.exists('hospital_compare_data'):
    os.makedirs('hospital_compare_data')

In [38]:
%%bash
cp HOSArchive_Revised_Flatfiles_20200731.zip hospital_compare_data 
rm HOSArchive_Revised_Flatfiles_20200731.zip

In [39]:
%%bash
cd hospital_compare_data
unzip HOSArchive_Revised_Flatfiles_20200731.zip 
ls

Archive:  HOSArchive_Revised_Flatfiles_20200731.zip
  inflating: ASC_CCN_pr19q1_19q4.csv  
  inflating: ASC_Facility.csv        
  inflating: ASC_National.csv        
  inflating: ASC_NATIONAL_pr19q1_19q4.csv  
  inflating: ASC_State.csv           
  inflating: ASC_STATE_pr19q1_19q4.csv  
  inflating: CJR PY4 Quality Reporting_July 2020_Production File.csv  
  inflating: CMS_PSI_6_decimal_file.csv  
  inflating: Complications and Deaths - Hospital.csv  
  inflating: Complications and Deaths - National.csv  
  inflating: Complications and Deaths - State.csv  
  inflating: Footnote Crosswalk.csv  
  inflating: footnotes_deliver_19q1_19q4.csv  
  inflating: FY2018_Distribution_of_Net_Change_in_Base_Op_DRG_Payment_Amt_2019-11-22.csv  
  inflating: FY2018_Net_Change_in_Base_Op_DRG_Payment_Amt_2019-11-22.csv  
  inflating: FY2018_Percent_Change_in_Medicare_Payments_2019-11-22.csv  
  inflating: FY2018_Value_Based_Incentive_Payment_Amount_2019-11-22.csv  
  inflating: HCAHPS - Hospital.csv   

In [40]:
%%bash
cd hospital_compare_data
mv 'Hospital General Information.csv' Hospital_General_Information.csv
mv 'Medicare Hospital Spending per Patient - Hospital.csv' Medicare_Hospital_spending_per_patient_Hospital.csv

In [41]:
%%bash
cp hospital_compare_data/Hospital_General_Information.csv .
cp hospital_compare_data/Medicare_Hospital_spending_per_patient_Hospital.csv .

ls


Data wrangling.ipynb
Hospital_General_Information.csv
MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
Medicare_Hospital_spending_per_patient_Hospital.csv
hospital_compare_data


In [42]:
%%bash
rm -r hospital_compare_data
ls

Data wrangling.ipynb
Hospital_General_Information.csv
MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
Medicare_Hospital_spending_per_patient_Hospital.csv


In [43]:
%%bash
git add Hospital_General_Information.csv Medicare_Hospital_spending_per_patient_Hospital.csv
git commit

[master 7f34862] Add two additionnal files from CMS: Hospital_General_Information.csv and Medicare_Hospital_spending_per_patient_Hospital.csv
 2 files changed, 10034 insertions(+)
 create mode 100644 raw_inputs/Hospital_General_Information.csv
 create mode 100644 raw_inputs/Medicare_Hospital_spending_per_patient_Hospital.csv


## 1.2 "Local" Data

Download:

- a zipcode - HRR (hospital referral region) crosswalk table from the Dartmouth Atlas
https://atlasdata.dartmouth.edu/downloads/geography/ZipHsaHrr17.xls    
- a crosswalk between zipcodes and counties https://www.huduser.gov/portal/datasets/usps_crosswalk.html (done manually)  
- IRS revenue data at county level: https://www.irs.gov/pub/irs-soi/17incyallnoagi.csv
- census data: cc-est2017-alldata.csv from (

In [46]:
%%bash
wget https://atlasdata.dartmouth.edu/downloads/geography/ZipHsaHrr17.xls  -q -nc

In [44]:
#move from download
os.rename("/Users/camilledethe/downloads/ZIP_COUNTY_122017.xlsx", "ZIP_COUNTY_122017.xlsx")


In [47]:
%%bash
wget https://www.irs.gov/pub/irs-soi/17incyallnoagi.csv  -q -nc

In [49]:
os.rename("/Users/camilledethe/downloads/us-zip-code-latitude-and-longitude.csv", "us-zip-code-latitude-and-longitude.csv")

In [82]:
#move from download
os.rename("/Users/camilledethe/downloads/cc-est2017-alldata.csv", "cc-est2017-alldata.csv")

In [118]:
%%bash 
git add 17incyallnoagi.csv "Data wrangling.ipynb" ZIP_COUNTY_122017.xlsx ZipHsaHrr17.xls cc-est2017-alldata.csv us-zip-code-latitude-and-longitude.csv

In [119]:
%%bash 
git commit

[master 0fae6a4] add all raw files for the "local" data set (including IRS, census, crosswalks files)
 6 files changed, 644392 insertions(+)
 create mode 100644 raw_inputs/17incyallnoagi.csv
 create mode 100644 raw_inputs/Data wrangling.ipynb
 create mode 100644 raw_inputs/ZIP_COUNTY_122017.xlsx
 create mode 100644 raw_inputs/ZipHsaHrr17.xls
 create mode 100644 raw_inputs/cc-est2017-alldata.csv
 create mode 100644 raw_inputs/us-zip-code-latitude-and-longitude.csv



# 2. Data preparation


## 2.1 Preparation of the "local" dataset

It's important not to destroy any information. So need 

### 2.1.1 Preparation of the basic structures: zipcode, longitude, latitude, county, HRR

missing zipcode 01040, 97471,85209
zip_coun = 39455 unique zipcode

In [215]:
len(missing_zipcode)

231

In [195]:
zip_coun=pd.read_excel('ZIP_COUNTY_122017.xlsx',usecols=[0,1],skiprows=0,dtype={'zip':str,'county':str})

In [196]:
#Pb for a couple of zipcodes we have several counties, select just the first one
zip_coun=zip_coun.groupby('zip').county.agg(['first']).reset_index()

In [197]:
zip_coun.rename(columns={'first':'county'}, inplace=True)

In [200]:
zip_coun.shape

(39455, 2)

In [224]:
zip_coun[zip_coun.zip.isin(missing_zipcode)]

Unnamed: 0,zip,county
204,01040,25013
212,01060,25015
218,01069,25015
232,01085,25013
245,01104,25013
...,...,...
34387,85209,04013
34639,85755,04019
34740,86409,04015
36251,92395,06071


In [198]:
long_lat=pd.read_csv('us-zip-code-latitude-and-longitude.csv',sep=';', dtype={'Zip':str}, 
                     usecols=['Zip','City','State','Latitude','Longitude'])

In [211]:
long_lat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43191 entries, 0 to 43190
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Zip        43191 non-null  object 
 1   City       43191 non-null  object 
 2   State      43191 non-null  object 
 3   Latitude   43191 non-null  float64
 4   Longitude  43191 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.6+ MB


In [225]:
local_1=zip_coun.merge(long_lat, left_on='zip',right_on='Zip')

In [226]:
local_1[local_1.zip.isin(missing_zipcode)]

Unnamed: 0,zip,county,Zip,City,State,Latitude,Longitude
203,01040,25013,01040,Holyoke,MA,42.201891,-72.62420
211,01060,25015,01060,Northampton,MA,42.324539,-72.63561
217,01069,25015,01069,Palmer,MA,42.176401,-72.32646
231,01085,25013,01085,Westfield,MA,42.133642,-72.75029
244,01104,25013,01104,Springfield,MA,42.130343,-72.57338
...,...,...,...,...,...,...,...
3028,08861,34023,08861,Perth Amboy,NJ,40.520105,-74.27708
3031,08865,34041,08865,Phillipsburg,NJ,40.689123,-75.17243
3039,08876,34035,08876,Somerville,NJ,40.545853,-74.63592
3050,08901,34023,08901,New Brunswick,NJ,40.488304,-74.44775


In [227]:
zip_hrr=pd.read_excel('ZipHsaHrr17.xls',dtype={'zipcode2017':str,'county':str})


In [238]:
zip_hrr['zipcode2017']=zip_hrr['zipcode2017'].apply(lambda x : x.zfill(5))
zip_hrr

Unnamed: 0,zipcode2017,hsanum,hsacity,hsastate,hrrnum,hrrcity,hrrstate,zip_pad
0,00501,33095,Patchogue,NY,301,East Long Island,NY,00501
1,00544,33095,Patchogue,NY,301,East Long Island,NY,00544
2,01001,22058,Springfield,MA,230,Springfield,MA,01001
3,01002,22046,Northampton,MA,230,Springfield,MA,01002
4,01003,22046,Northampton,MA,230,Springfield,MA,01003
...,...,...,...,...,...,...,...,...
40868,99926,2006,Ketchikan,AK,10,Anchorage,AK,99926
40869,99927,2006,Ketchikan,AK,10,Anchorage,AK,99927
40870,99928,2006,Ketchikan,AK,10,Anchorage,AK,99928
40871,99929,2015,Wrangell,AK,10,Anchorage,AK,99929


In [239]:
local_2=zip_hrr.merge(local_1,left_on='zipcode2017',right_on='zip')

In [240]:
local_2[local_2.zipcode2017.isin(missing_zipcode)]

Unnamed: 0,zipcode2017,hsanum,hsacity,hsastate,hrrnum,hrrcity,hrrstate,zip_pad,zip,county,Zip,City,State,Latitude,Longitude
31,01040,22023,Holyoke,MA,230,Springfield,MA,01040,01040,25013,01040,Holyoke,MA,42.201891,-72.62420
39,01060,22046,Northampton,MA,230,Springfield,MA,01060,01060,25015,01060,Northampton,MA,42.324539,-72.63561
45,01069,22049,Palmer,MA,230,Springfield,MA,01069,01069,25015,01069,Palmer,MA,42.176401,-72.32646
59,01085,22066,Westfield,MA,230,Springfield,MA,01085,01085,25013,01085,Westfield,MA,42.133642,-72.75029
72,01104,22058,Springfield,MA,230,Springfield,MA,01104,01104,25013,01104,Springfield,MA,42.130343,-72.57338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2829,08822,31014,Flemington,NJ,356,Philadelphia,PA,08822,08822,34019,08822,Flemington,NJ,40.515645,-74.85319
2856,08861,31042,Perth Amboy,NJ,288,New Brunswick,NJ,08861,08861,34023,08861,Perth Amboy,NJ,40.520105,-74.27708
2859,08865,31043,Phillipsburg,NJ,346,Allentown,PA,08865,08865,34041,08865,Phillipsburg,NJ,40.689123,-75.17243
2867,08876,31056,Somerville,NJ,288,New Brunswick,NJ,08876,08876,34035,08876,Somerville,NJ,40.545853,-74.63592


In [241]:
local_2=local_2[['zipcode2017','hrrnum', 'county','City','State','Latitude','Longitude']]

### 2.1.2 Adjunction of IRS data at county level

In [242]:
df2=pd.read_csv('17incyallnoagi.csv', usecols=['STATEFIPS','COUNTYFIPS','N00200','A00200','A00100','N1','N2'],dtype={'STATEFIPS':str,'COUNTYFIPS':str ,'CBSACODE':str})



In [243]:

df2['COUNTY_FULL_FIPS']=df2['STATEFIPS']+df2['COUNTYFIPS']

irs_2=df2.groupby('COUNTY_FULL_FIPS').sum().reset_index()

irs_2['average_AGI_c']=irs_2.A00100/irs_2.N1
irs_2['average_wage_c']=irs_2.A00200  /irs_2.N00200

irs_2=irs_2[['COUNTY_FULL_FIPS','average_AGI_c','average_wage_c']]

In [244]:
local_3=local_2.merge(irs_2, left_on='county',right_on='COUNTY_FULL_FIPS')

In [245]:
local_3.drop(columns=['COUNTY_FULL_FIPS'],inplace=True)

### 2.1.3 Adjunction of Census data at county level

In [246]:
dfc=pd.read_csv('cc-est2017-alldata.csv',encoding='latin-1', 
            usecols=['STATE','COUNTY','YEAR','AGEGRP','TOT_POP', 'BA_MALE', 'BA_FEMALE',  'H_MALE','H_FEMALE'],
            dtype= {'STATE':str,'COUNTY':str}   )


In [247]:
#Select only the lines for 2017
dfc=dfc[dfc['YEAR']==10]

In [248]:
#recreate the county
dfc['COUNTY_ID']=dfc['STATE']+dfc['COUNTY']


In [249]:
dfc['BA']=dfc['BA_MALE']+dfc['BA_FEMALE']
dfc['H']=dfc['H_MALE']+dfc['H_FEMALE']


In [250]:



dfc['over_65']=0
dfc.loc[dfc['AGEGRP']>13,'over_65']=dfc['TOT_POP']

# build the proportion of people above 65 ie agegrp 14,15,16,17,18
dfc=dfc.groupby('COUNTY_ID').agg(
                pop_tot=pd.NamedAgg(column='TOT_POP', aggfunc='max'),
                pop_ba=pd.NamedAgg(column='BA', aggfunc='max'),
                pop_h=pd.NamedAgg(column='H', aggfunc='max'),
                pop_over_65=pd.NamedAgg(column='over_65', aggfunc=np.sum ))
   

dfc=dfc.reset_index()

dfc['per_over_65']=dfc.pop_over_65/dfc.pop_tot
dfc['per_ba']=dfc.pop_ba/dfc.pop_tot
dfc['per_h']=dfc.pop_h/dfc.pop_tot

dfc=dfc[['COUNTY_ID','per_over_65','per_ba','per_h']]

In [251]:
local = local_3.merge(dfc,left_on='county',right_on='COUNTY_ID' )

In [253]:
local.head()

Unnamed: 0,zipcode2017,hrrnum,county,City,State,Latitude,Longitude,average_AGI_c,average_wage_c,COUNTY_ID,per_over_65,per_ba,per_h
0,501,301,36103,Holtsville,NY,40.922326,-72.637078,89.476674,73.885973,36103,0.163743,0.085894,0.195437
1,6390,111,36103,Fishers Island,NY,41.261936,-72.00708,89.476674,73.885973,36103,0.163743,0.085894,0.195437
2,11702,301,36103,Babylon,NY,40.687649,-73.32549,89.476674,73.885973,36103,0.163743,0.085894,0.195437
3,11703,301,36103,North Babylon,NY,40.733398,-73.32257,89.476674,73.885973,36103,0.163743,0.085894,0.195437
4,11704,301,36103,West Babylon,NY,40.719249,-73.35829,89.476674,73.885973,36103,0.163743,0.085894,0.195437


In [254]:
local[local.zipcode2017.isin(missing_zipcode)]

Unnamed: 0,zipcode2017,hrrnum,county,City,State,Latitude,Longitude,average_AGI_c,average_wage_c,COUNTY_ID,per_over_65,per_ba,per_h
121,01040,230,25013,Holyoke,MA,42.201891,-72.624200,59.715230,49.727145,25013,0.164700,0.107965,0.252553
130,01085,230,25013,Westfield,MA,42.133642,-72.750290,59.715230,49.727145,25013,0.164700,0.107965,0.252553
137,01104,230,25013,Springfield,MA,42.130343,-72.573380,59.715230,49.727145,25013,0.164700,0.107965,0.252553
155,01199,230,25013,Springfield,MA,42.119943,-72.604983,59.715230,49.727145,25013,0.164700,0.107965,0.252553
171,01060,230,25015,Northampton,MA,42.324539,-72.635610,71.770153,59.584764,25015,0.165608,0.033658,0.056453
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2923,08210,283,34009,Cape May Court House,NJ,39.081754,-74.836580,64.498280,48.672040,34009,0.256336,0.049630,0.078298
2964,08534,356,34021,Pennington,NJ,40.323150,-74.783640,97.825045,80.590870,34021,0.147694,0.213870,0.177895
2985,08629,356,34021,Trenton,NJ,40.219358,-74.733340,97.825045,80.590870,34021,0.147694,0.213870,0.177895
2986,08638,356,34021,Trenton,NJ,40.249908,-74.759530,97.825045,80.590870,34021,0.147694,0.213870,0.177895


In [126]:

if not os.path.exists('../prepared_data'):
    os.makedirs('../prepared_data')

In [267]:
local.to_csv(path_or_buf='../prepared_data/Output_local_basis.csv',index=False)

In [129]:
%%bash
git add ../prepared_data/Output_local_basis.csv

In [271]:
%%bash
git a

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   ../Visualizations.ipynb
	modified:   Data wrangling.ipynb

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	../.DS_Store
	../.ipynb_checkpoints/
	.DS_Store
	.ipynb_checkpoints/
	2019_Gaz_zcta_national.txt

no changes added to commit (use "git add" and/or "git commit -a")


## 2.2 Preparation of the hospital dataset and the DRG dictionnaries



**1. The inpatient dataset**

- 196325 observations on 7 million discharges


- 3182 providers ie hospitals (3181 avec filtration of DRG with less than 8 different hospitals)  
301 have 5 observations or less,   
552 have 10 or less observations


- 563 DRG (440 after filtrations)   
106 have 5 or less observations  
149 have 10 or less observations  
256 have 50 or less observations

explain the selection effect: the fact that some observations are missing. Find some procedures. Most common procedures

My idea plot the number of discharges on the y axis and the overall number of discharges per hospital

Document all potential discrepancies


In [None]:
#bUild the number of discharges per hospital and in the y axis the % of DRG that are described

In [98]:
inpatient=pd.read_csv('MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.csv',
                      dtype={'Provider Id':str,'Provider Zip Code':str },
                     usecols=['DRG Definition', 'Provider Id','Provider State', 'Provider Zip Code',
       'Total Discharges', 'Average Covered Charges', 'Average Total Payments',
       'Average Medicare Payments'])

inpatient.columns=['DRG', 'prov_id','Provider State','prov_zip',
       'total_discharges', 'average_covered_charges', 'average_total_payments',
       'average_medicare_payments']

In [53]:
inpatient.prov_id.apply(lambda x : len(x))

0         5
1         5
2         5
3         5
4         5
         ..
196320    6
196321    6
196322    6
196323    6
196324    6
Name: prov_id, Length: 196325, dtype: int64

In [54]:
#Treat the column prov_id
#inpatient['prov_id']=inpatient['prov_id'].apply(lambda x : '0'+x if len(x)==5 else x)
inpatient['prov_id']=inpatient['prov_id'].apply(lambda x : '0'+x if len(x)==5 else x)


**2. The hospital dataset**

Merge of the hospital general information and the average spending per medicare patient dataset.   
Select only Acute Care Hospital.   
Filter out the states outside the 50 states 'AS', 'GU','MP','PR','VI'


In [169]:
hosp=pd.read_csv('Hospital_General_Information.csv' ,
                dtype={'Facility ID':str,'ZIP Code' : str}
                 ,
                 usecols=['Facility ID','Facility Name','ZIP Code', 'State',
                          'Hospital Type',  'Hospital Ownership','Emergency Services', 
                          'Hospital overall rating', 'Hospital overall rating footnote'] 
                )


In [170]:
hosp=hosp[~hosp.State.isin(['AS', 'GU','MP','PR','VI'])]

In [171]:
hosp.columns=['Facility ID','hosp_name','state','zipcode','hosp_type','hosp_ownership',
       'hosp_emergency_services',
       'hosp_rating', 'hosp_rating_fn']

In [172]:
med_spending=pd.read_csv('Medicare_Hospital_spending_per_patient_Hospital.csv'
                         , 
                         usecols=['Facility ID','Score','Footnote'], 
                         dtype={'Facility ID':str}
                        )
med_spending.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4712 entries, 0 to 4711
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Facility ID  4712 non-null   object 
 1   Score        4712 non-null   object 
 2   Footnote     1651 non-null   float64
dtypes: float64(1), object(2)
memory usage: 110.6+ KB


In [173]:
hosp_full=med_spending.merge(hosp,how='inner', on = 'Facility ID')

In [176]:
hosp_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3205 entries, 0 to 4652
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Facility ID              3205 non-null   object 
 1   Score                    3205 non-null   object 
 2   Footnote                 144 non-null    float64
 3   hosp_name                3205 non-null   object 
 4   state                    3205 non-null   object 
 5   zipcode                  3205 non-null   object 
 6   hosp_type                3205 non-null   object 
 7   hosp_ownership           3205 non-null   object 
 8   hosp_emergency_services  3205 non-null   object 
 9   hosp_rating              3205 non-null   object 
 10  hosp_rating_fn           331 non-null    float64
dtypes: float64(2), object(9)
memory usage: 300.5+ KB


In [177]:
hosp_full=hosp_full[hosp_full.hosp_type=='Acute Care Hospitals']

In [178]:
hosp_full.shape

(3205, 11)

From the 5320 initial hospitals in the hospital General Information dataset, only 4712 remain. The 'deleted hospitals' are only Psychatric Hospitals et Acute Care - Department of Defense.

Then we select only the Acute Care Hospitals: only 3263 remain. 

Let's check the numbers by category for the model

In [91]:
data_shema=hosp_full[['Facility ID','Footnote','prov_id']].drop_duplicates()
data_shema[data_shema['prov_id'].isna() & data_shema.Footnote.notna()].shape[0]
data_shema[data_shema['prov_id'].isna() & data_shema.Footnote.isna()].shape[0]
data_shema[data_shema['prov_id'].notna() & data_shema.Footnote.notna()].shape[0]
data_shema[data_shema['prov_id'].isna() & data_shema.Footnote.notna()].shape[0]

Check that we have local information for every hospital in the full dataset. 

Almost 200 missing information comes from zipcode starting with a zero...

In [179]:
local=pd.read_csv('../prepared_data/Output_local_basis.csv',dtype={'zipcode2017':str})

In [263]:
full_local_hosp=local.merge(hosp_full,how='outer', left_on='zipcode2017',right_on='zipcode')

In [264]:
missing_zipcode=full_local_hosp[full_local_hosp['Facility ID'].notna() & full_local_hosp.zipcode2017.isna()].sort_values(by=['zipcode']).zipcode

Around 2315 observations in the inpatients dataset, corresponding to 103  hospitals could not been matched to a provider id in the hosp dataset. It's just 1.1% of total information. This is due to some mergers and acquisitions which made some hospitals 'disappear'.  

Conversely, 1354 Critical Access Hospitals, 184 Acute Care Hospitals and 95 children hospital don't have records in the inpatient dataset. 

In the merged data set provides records on 3079 Acute Care Hospitals. 59 hospitals don't have observations. 

Among the 

In [266]:
len(missing_zipcode)

37

In [258]:
local.zipcode2017.apply(lambda x : len(x)).value_counts()

5    38883
Name: zipcode2017, dtype: int64

In [259]:
local.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38883 entries, 0 to 38882
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   zipcode2017     38883 non-null  object 
 1   hrrnum          38883 non-null  int64  
 2   county          38883 non-null  object 
 3   City            38883 non-null  object 
 4   State           38883 non-null  object 
 5   Latitude        38883 non-null  float64
 6   Longitude       38883 non-null  float64
 7   average_AGI_c   38883 non-null  float64
 8   average_wage_c  38883 non-null  float64
 9   COUNTY_ID       38883 non-null  object 
 10  per_over_65     38883 non-null  float64
 11  per_ba          38883 non-null  float64
 12  per_h           38883 non-null  float64
dtypes: float64(7), int64(1), object(5)
memory usage: 4.2+ MB


In [260]:
hosp_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3205 entries, 0 to 4652
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Facility ID              3205 non-null   object 
 1   Score                    3205 non-null   object 
 2   Footnote                 144 non-null    float64
 3   hosp_name                3205 non-null   object 
 4   state                    3205 non-null   object 
 5   zipcode                  3205 non-null   object 
 6   hosp_type                3205 non-null   object 
 7   hosp_ownership           3205 non-null   object 
 8   hosp_emergency_services  3205 non-null   object 
 9   hosp_rating              3205 non-null   object 
 10  hosp_rating_fn           331 non-null    float64
dtypes: float64(2), object(9)
memory usage: 300.5+ KB


In [261]:
full_local_hosp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39210 entries, 0 to 39209
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   zipcode2017              39173 non-null  object 
 1   hrrnum                   39173 non-null  float64
 2   county                   39173 non-null  object 
 3   City                     39173 non-null  object 
 4   State                    39173 non-null  object 
 5   Latitude                 39173 non-null  float64
 6   Longitude                39173 non-null  float64
 7   average_AGI_c            39173 non-null  float64
 8   average_wage_c           39173 non-null  float64
 9   COUNTY_ID                39173 non-null  object 
 10  per_over_65              39173 non-null  float64
 11  per_ba                   39173 non-null  float64
 12  per_h                    39173 non-null  float64
 13  Facility ID              3205 non-null   object 
 14  Score                 

When merging the hospital database and the local database 289 hospitals are unmatched. Several are in some unincorporated US territories. But most of them  

In [None]:




#filter the DRG with the tables built in Work on my DRG list notebook
list_DRG=dill.load(open('healthcare_simulator/app/list_DRG.pkd', 'rb'))   
inpatient=inpatient[inpatient['DRG'].isin(list_DRG)]

#Treat the column prov_id
#inpatient['prov_id']=inpatient['prov_id'].apply(lambda x : '0'+x if len(x)==5 else x)
temp=inpatient[inpatient['Provider State']=='CA'][['prov_id','Provider State']]
inpatient['prov_id']=inpatient['prov_id'].apply(lambda x : '0'+x if len(x)==5 else x)


**2. The hospital reference dataset and the medicare average spending dataset**


- The hospital general information dataset   
4812 hospitals   
Acute Care Hospitals         3369   
Critical Access Hospitals    1344   
Childrens                      99    


- The medicare average spending comparision dataset (Measure 2018)  
3143 individual observations   

- Merge  
Acute Care Hospitals    3070 non null

temp2=pd.read_csv('Data/Hospital General Information.csv',encoding = "ISO-8859-1")
temp3=temp2[temp2['State']=='CA'][['Provider ID','Hospital Name']].sort_values(by=['Provider ID']).head(50)

temp2.info()

hosp=pd.read_csv('Data/Hospital General Information.csv',encoding = "ISO-8859-1"
                ,dtype={'Provider ID':str,'ZIP Code' : str},usecols=['Provider ID','Hospital Name', 'State','ZIP Code',
       'Hospital Type',  'Hospital Ownership',
       'Emergency Services', 'Meets criteria for meaningful use of EHRs',
       'Hospital overall rating', 'Hospital overall rating footnote',
       'Mortality national comparison',
       'Mortality national comparison footnote',
       'Safety of care national comparison',
       'Safety of care national comparison footnote',
       'Readmission national comparison',
       'Readmission national comparison footnote',
       'Patient experience national comparison',
       'Patient experience national comparison footnote',
       'Effectiveness of care national comparison',
       'Effectiveness of care national comparison footnote',
       'Timeliness of care national comparison',
       'Timeliness of care national comparison footnote',
       'Efficient use of medical imaging national comparison',
       'Efficient use of medical imaging national comparison footnote'] )




hosp.columns=['Provider ID','hosp_name','State', 'zipcode','hosp_type','hosp_ownership',
       'hosp_emergency_services', 'hosp_meaningful_use_EHRs',
       'hosp_rating', 'hosp_rating_fn',
       'hosp_mortality_natcomp',
       'hosp_mortality_natcomp_fn',
       'hosp_safety_natcomp',
       'hosp_safety_natcomp_fn',
       'hosp_readmission_natcomp',
       'hosp_readmission_natcomp_fn',
       'hosp_patientexperience_natcomp',
       'hosp_patientexperience_natcomp_fn',
       'hosp_effectiveness_natcomp',
       'hosp_effectiveness_natcomp_fn',
       'hosp_timeliness_natcomp',
       'hosp_timeliness_natcomp_fn',
       'hosp_efficientmedicalimaging_natcomp',
       'hosp_efficientmedicalimaging_natcomp_fn']

hosp['hosp_rating'].value_counts()

#Select only acute care hospitals
hosp=hosp[hosp['hosp_type']=='Acute Care Hospitals']

temp4=pd.read_csv('Data/hosp_medicare_paiment_comparision.csv',  dtype={'Facility ID':str})
temp4

med_spending=pd.read_csv('Data/hosp_medicare_paiment_comparision.csv', 
                         usecols=['Facility ID','Value','Footnote'], dtype={'Facility ID':str})
med_spending.info()

new_hosp=hosp.merge(med_spending, how='left', left_on='Provider ID', right_on='Facility ID')

new_hosp[new_hosp['Value'].notna()]['hosp_type'].value_counts()

new_hosp['Provider ID'].value_counts()



In [24]:
%%bash 
ls -R .. 


Visualizations.ipynb
prepared_data
raw_inputs

../prepared_data:
Output_local_basis.csv

../raw_inputs:
17incyallnoagi.csv
Data wrangling.ipynb
Hospital_General_Information.csv
MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
Medicare_Hospital_spending_per_patient_Hospital.csv
ZIP_COUNTY_122017.xlsx
ZipHsaHrr17.xls
cc-est2017-alldata.csv
us-zip-code-latitude-and-longitude.csv


In [50]:
%%bash
ls

17incyallnoagi.csv
Data wrangling.ipynb
Hospital_General_Information.csv
MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
Medicare_Hospital_spending_per_patient_Hospital.csv
ZIP_COUNTY_122017.xlsx
ZipHsaHrr17.xls
index.html?format=csv
us-zip-code-latitude-and-longitude.csv


In [92]:
local_3

Unnamed: 0,zipcode2017,hrrnum,county,City,State,Latitude,Longitude,average_AGI_c,average_wage_c
0,10001,303,36061,New York,NY,40.750742,-73.99653,226.146157,146.194817
1,10002,303,36061,New York,NY,40.717040,-73.98700,226.146157,146.194817
2,10003,303,36061,New York,NY,40.732509,-73.98935,226.146157,146.194817
3,10004,303,36061,New York,NJ,40.699226,-74.04118,226.146157,146.194817
4,10005,303,36061,New York,NY,40.706019,-74.00858,226.146157,146.194817
...,...,...,...,...,...,...,...,...,...
35994,99922,10,02198,Hydaburg,AK,55.209339,-132.82545,52.254701,44.950000
35995,99923,10,02198,Hyder,AK,55.941442,-130.05450,52.254701,44.950000
35996,99925,10,02198,Klawock,AK,55.555164,-133.07316,52.254701,44.950000
35997,99926,10,02198,Metlakatla,AK,55.123897,-131.56883,52.254701,44.950000


In [115]:
%%bash
git status

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	../.DS_Store
	.DS_Store
	.ipynb_checkpoints/
	17incyallnoagi.csv
	Data wrangling.ipynb
	ZIP_COUNTY_122017.xlsx
	ZipHsaHrr17.xls
	cc-est2017-alldata.csv
	us-zip-code-latitude-and-longitude.csv

nothing added to commit but untracked files present (use "git add" to track)


In [120]:
%%bash
git ls-tree -r master --name-only


17incyallnoagi.csv
Data wrangling.ipynb
Hospital_General_Information.csv
MEDICARE_PROVIDER_CHARGE_INPATIENT_DRGALL_FY2017.CSV
Medicare_Hospital_spending_per_patient_Hospital.csv
ZIP_COUNTY_122017.xlsx
ZipHsaHrr17.xls
cc-est2017-alldata.csv
us-zip-code-latitude-and-longitude.csv
