
<center> <h1>Read HCUP data </h1> </center>

<img src="https://github.com/Azure/cortana-intelligence-population-health-management/blob/master/ManualDeploymentGuide/media/hcuplogo3.PNG?raw=true", width="700" height="500" />
<img src="https://github.com/Azure/cortana-intelligence-population-health-management/blob/master/ManualDeploymentGuide/media/hcuplogo2.PNG?raw=true", width="500" height="500" />

The Healthcare Cost and Utilization Project ([HCUP](https://www.hcup-us.ahrq.gov/)) includes the largest collection of longitudinal hospital care data in the United States. The data can be purchased from [here](https://www.hcup-us.ahrq.gov/tech_assist/centdist.jsp).  
In this notebook we will show you how to read HCUP State Inpatient Data ([SID](https://www.hcup-us.ahrq.gov/sidoverview.jsp)) data for analysis and modelling.  
The HCUP SID data comes as ascii files as shown in samples [here](https://github.com/Azure/cortana-intelligence-population-health-management/tree/master/ManualDeploymentGuide/Model/SampleHCUPdata). The description of each of these files can be found at   
https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_CORE.loc  
https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_CHGS.loc  
https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_SEVERITY.loc  
https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_DX_PR_GRPS.loc   
We will use these description files and the sample ascii data files to create csv files with headers to be used for analysis.



In [None]:
# clear workspace
rm(list=ls())
gc()

In [33]:
file = 'core_desc.txt'
download.file('https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_CORE.loc',file)

str_start = c(1,5,10,27,31,58,63,68,70,75)
str_end   = c(3,8,25,29,56,61,66,68,73,174)

widths = (str_end - str_start)+2

des = read.fwf(file,skip=20, widths= widths)

des$V10 = gsub(',','_',des$V10)
colnameslong = data.frame(core_2011 = des$V10)
colnames = data.frame(core_2011 = des$V5)

#Now use the description file (_des.txt) to get str_start and str_end for reading the actual data!!
str_start = des$V6
str_end = des$V7

widths = (str_end - str_start)+1

# https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/WA_SID_2011_CHGS.asc
#download.file('https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc','Sample_WA_SID_2011_CHGS.asc')
#system('wget https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc')
pathd = 'C:/dsvm/notebooks/HealthcareSolution/2011/SampleHCUPdata/'
dat_core = read.fwf(paste(pathd,'Sample_WA_SID_2011_CORE.asc',sep='/'),widths= widths)
dim(dat_core)
names(dat_core) = colnames$core_2011
head(dat_core)
# write.csv(dat_core, 'Sample_WA_SID_2011_CORE.csv', row.names = F )

Unnamed: 0,AGE,AGEDAY,AGEMONTH,AHOUR,AMONTH,ATYPE,AWEEKEND,DHOUR,DIED,DISPUB04,ellip.h,TOTCHG,TOTCHG_X,TRAN_IN,TRAN_OUT,VisitLink,YEAR,ZIP3,ZIPINC_QRTL,ZIP,AYEAR
1,52,-99,-99,700,1,3,0,1400,0,1,<8b>,56511,56510.96,0,0,36389,2011,981,2,98122,2011
2,65,-99,-99,1000,1,3,0,1400,0,6,<8b>,140956,140956.5,0,0,36390,2011,981,3,98144,2011
3,87,-99,-99,700,1,3,0,1100,0,1,<8b>,12687,12687.35,0,0,36391,2011,981,3,98109,2011
4,23,-99,-99,500,1,3,0,1400,0,1,<8b>,23402,23402.2,0,0,36392,2011,980,4,98033,2011
5,30,-99,-99,2200,11,3,0,1300,0,1,<8b>,240352,240351.7,0,0,36393,2011,982,4,98208,2010
6,29,-99,-99,1200,11,3,0,600,0,5,<8b>,282202,282202.2,0,1,36394,2011,982,1,98225,2010


In [43]:
file = 'charges_desc.txt'
download.file('https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_CHGS.loc',file)

str_start = c(1,5,10,27,31,58,63,68,70,75)
str_end   = c(3,8,25,29,56,61,66,68,73,174)

widths = (str_end - str_start)+2

des = read.fwf(file,skip=20, widths= widths)

des$V10 = gsub(',','_',des$V10)
colnameslong = data.frame(charges_2011 = des$V10)
colnames = data.frame(charges_2011 = des$V5)

#Now use the description file (_des.txt) to get str_start and str_end for reading the actual data!!
str_start = des$V6
str_end = des$V7

widths = (str_end - str_start)+1

# https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/WA_SID_2011_CHGS.asc
#download.file('https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc','Sample_WA_SID_2011_CHGS.asc')
#system('wget https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc')
pathd = 'C:/dsvm/notebooks/HealthcareSolution/2011/SampleHCUPdata/'
dat_chrg = read.fwf(paste(pathd,'Sample_WA_SID_2011_CHGS.asc',sep='/'),widths= widths)
dim(dat_chrg)
names(dat_chrg) = colnames$charges_2011
head(dat_chrg)
# write.csv(dat_chrg, 'Sample_WA_SID_2011_CHGS.csv', row.names = F )

Unnamed: 0,KEY,NREVCD,REVCD1,REVCD2,REVCD3,REVCD4,REVCD5,REVCD6,REVCD7,REVCD8,ellip.h,UNIT43,UNIT44,UNIT45,UNIT46,UNIT47,UNIT48,UNIT49,UNIT50,UNIT51,UNIT52
1,532011100000000.0,15,1,120,250,258,272,278,300,305,<8b>,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
2,532011100000000.0,18,1,120,250,258,270,272,278,300,<8b>,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
3,532011100000000.0,15,1,120,250,270,272,300,301,305,<8b>,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
4,532011100000000.0,11,1,122,250,270,272,300,305,360,<8b>,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
5,532011100000000.0,13,1,129,250,300,301,302,305,306,<8b>,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999
6,532011100000000.0,12,1,129,250,300,301,302,305,306,<8b>,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999,-9999999


In [34]:
file = 'severity_desc.txt'
download.file('https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_SEVERITY.loc',file)

str_start = c(1,5,10,27,31,58,63,68,70,75)
str_end   = c(3,8,25,29,56,61,66,68,73,174)

widths = (str_end - str_start)+2

des = read.fwf(file,skip=20, widths= widths)

des$V10 = gsub(',','_',des$V10)
colnameslong = data.frame(severity_2011 = des$V10)
colnames = data.frame(severity_2011 = des$V5)

#Now use the description file (_des.txt) to get str_start and str_end for reading the actual data!!
str_start = des$V6
str_end = des$V7

widths = (str_end - str_start)+1

# https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/WA_SID_2011_CHGS.asc
#download.file('https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc','Sample_WA_SID_2011_CHGS.asc')
#system('wget https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc')
pathd = 'C:/dsvm/notebooks/HealthcareSolution/2011/SampleHCUPdata/'
dat_sevr = read.fwf(paste(pathd,'Sample_WA_SID_2011_SEVERITY.asc',sep='/'),widths= widths)
dim(dat_sevr)
names(dat_sevr) = colnames$severity_2011
head(dat_sevr)
# write.csv(dat_sevr, 'Sample_WA_SID_2011_SEVERITY.csv', row.names = F )

Unnamed: 0,KEY,CM_AIDS,CM_ALCOHOL,CM_ANEMDEF,CM_ARTH,CM_BLDLOSS,CM_CHF,CM_CHRNLUNG,CM_COAG,CM_DEPRESS,ellip.h,CM_OBESE,CM_PARA,CM_PERIVASC,CM_PSYCH,CM_PULMCIRC,CM_RENLFAIL,CM_TUMOR,CM_ULCER,CM_VALVE,CM_WGHTLOSS
1,532011100000000.0,1,0,0,1,0,0,0,0,1,<8b>,0,0,0,0,0,0,0,0,0,0
2,532011100000000.0,0,0,0,0,0,0,1,0,0,<8b>,1,0,0,0,0,0,0,0,0,0
3,532011100000000.0,0,0,0,0,0,1,0,0,0,<8b>,0,0,0,0,0,0,0,0,0,0
4,532011100000000.0,0,0,1,0,1,0,0,0,0,<8b>,0,0,0,0,0,0,0,0,0,0
5,532011100000000.0,0,1,1,0,1,0,0,0,1,<8b>,0,0,0,0,0,0,0,0,0,0
6,532011100000000.0,0,0,0,0,0,0,0,0,1,<8b>,0,0,0,1,0,0,0,0,0,0


In [35]:
file = 'dxpr_desc.txt'
download.file('https://www.hcup-us.ahrq.gov/db/state/sidc/tools/filespecs/WA_SID_2011_DX_PR_GRPS.loc',file)

str_start = c(1,5,10,27,31,58,63,68,70,75)
str_end   = c(3,8,25,29,56,61,66,68,73,174)

widths = (str_end - str_start)+2

des = read.fwf(file,skip=20, widths= widths)

des$V10 = gsub(',','_',des$V10)
colnameslong = data.frame(dxpr_2011 = des$V10)
colnames = data.frame(dxpr_2011 = des$V5)

#Now use the description file (_des.txt) to get str_start and str_end for reading the actual data!!
str_start = des$V6
str_end = des$V7

widths = (str_end - str_start)+1

# https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/WA_SID_2011_CHGS.asc
#download.file('https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc','Sample_WA_SID_2011_CHGS.asc')
#system('wget https://github.com/Azure/cortana-intelligence-population-health-management/raw/master/ManualDeploymentGuide/Model/SampleHCUPdata/Sample_WA_SID_2011_CHGS.asc')
pathd = 'C:/dsvm/notebooks/HealthcareSolution/2011/SampleHCUPdata/'
dat_dxpr = read.fwf(paste(pathd,'Sample_WA_SID_2011_DX_PR_GRPS.asc',sep='/'),widths= widths)
dim(dat_dxpr)
names(dat_dxpr) = colnames$dxpr_2011
head(dat_dxpr)
# write.csv(dat_dxpr, 'Sample_WA_SID_2011_DX_PR_GRPS.csv', row.names = F )

Unnamed: 0,CHRON1,CHRON2,CHRON3,CHRON4,CHRON5,CHRON6,CHRON7,CHRON8,CHRON9,CHRON10,ellip.h,U_OCCTHERAPY,U_ORGANACQ,U_OTHIMPLANTS,U_PACEMAKER,U_PHYTHERAPY,U_RADTHERAPY,U_RESPTHERAPY,U_SPEECHTHERAPY,U_STRESS,U_ULTRASOUND
1,1,1,1,1,1,1,1,1,1,1,<8b>,1,0,1,0,1,0,0,0,0,0
2,0,0,0,1,1,1,1,1,1,1,<8b>,1,0,1,0,1,0,0,0,0,0
3,0,1,1,0,1,1,1,1,1,0,<8b>,1,0,0,0,1,0,0,0,0,0
4,0,0,1,0,0,0,-9,-9,-9,-9,<8b>,0,0,0,0,0,0,0,0,0,0
5,1,1,0,1,1,1,1,0,0,1,<8b>,0,0,0,0,1,0,0,0,0,1
6,1,1,1,0,0,1,1,1,1,1,<8b>,0,0,0,0,0,0,1,0,0,1


In [None]:
dat = merge(merge(merge(dat_sevr, dat_chrg, by = "KEY"), dat_core, by = "KEY"), dat_dxpr, by ="KEY")  
dim(dat)
head(dat)
#write.csv(dat, 'Sample_WA_SID_2011.csv', row.names = F )