In this example (example #1) we compare two predictive models based on our data.  By default we will use our prepackaged example, which predicts 30 day hospital readmissions based on diabetes related health data.

First, load in the package. BTW, type ?healthcareai (any time after loading the package) for the docs!

In [10]:
library(healthcareai)

Load data from SQL Server...

In [11]:
#Set up connection by specifying your server
connection.string = 'driver={SQL Server};
                     server=localhost;
                     trusted_connection=true'

#Specify the query and pull into the data frame
query = "SELECT
       [PatientEncounterID]
      ,[PatientID]
      ,[SystolicBPNBR]
      ,[LDLNBR]
      ,[A1CNBR]
      ,[GenderFLG]
      ,[ThirtyDayReadmitFLG]
      ,[InTestWindowFLG]
  FROM [SAM].[dbo].[DiabetesClinical]
  WHERE InTestWindowFLG = 'N'" # Only grab training set when developing/comparing the models

df <- selectData(connection.string, query)

...or load directly from .csv 

In [12]:
# This line will identify our prepackaged sample data for loading.  You can delete this if using your own data.
# csvfile <- system.file("extdata", "DiabetesClinicalOutpatient.csv", package = "healthcareai")

#df <- read.csv(file = csvfile, #<-- or path/to/yourfile.csv
#                    header = TRUE,
#                    na.strings = c('NULL', 'NA', ""))

Check the data types of the dataframe to make sure factor cols aren't listed as numeric cols, etc.

In [13]:
str(df)

'data.frame':	987 obs. of  8 variables:
 $ PatientEncounterID : int  1 2 3 4 5 6 7 8 9 10 ...
 $ PatientID          : int  10001 10001 10001 10002 10002 10002 10002 10003 10003 10003 ...
 $ SystolicBPNBR      : int  167 153 170 187 188 185 189 149 155 160 ...
 $ LDLNBR             : int  195 214 191 135 125 178 101 160 144 130 ...
 $ A1CNBR             : num  4.2 5 4 4.4 4.3 5 4 5 6.6 8 ...
 $ GenderFLG          : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
 $ ThirtyDayReadmitFLG: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 2 ...
 $ InTestWindowFLG    : Factor w/ 1 level "N": 1 1 1 1 1 1 1 1 1 1 ...


Change a column type, if necessary.

In [14]:
df$GenderFLG      = as.factor(df$GenderFLG)
df$LDLNBR     = as.numeric(df$LDLNBR) # only here for demonstration
str(df)

'data.frame':	987 obs. of  8 variables:
 $ PatientEncounterID : int  1 2 3 4 5 6 7 8 9 10 ...
 $ PatientID          : int  10001 10001 10001 10002 10002 10002 10002 10003 10003 10003 ...
 $ SystolicBPNBR      : int  167 153 170 187 188 185 189 149 155 160 ...
 $ LDLNBR             : num  195 214 191 135 125 178 101 160 144 130 ...
 $ A1CNBR             : num  4.2 5 4 4.4 4.3 5 4 5 6.6 8 ...
 $ GenderFLG          : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
 $ ThirtyDayReadmitFLG: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 2 ...
 $ InTestWindowFLG    : Factor w/ 1 level "N": 1 1 1 1 1 1 1 1 1 1 ...


Define model parameters

In [15]:
set.seed(42) # <-- used to make results reproducible
p <- SupervisedModelDevelopmentParams$new()
p$df = df
p$type = 'classification'
p$impute = TRUE
p$predictedCol = 'ThirtyDayReadmitFLG'
p$cores = 1

Now that we've arranged the data and done imputation, let's create a LASSO model and
1) See how accurate it is and
2) See which variable are important.

In [16]:
# Run Lasso
Lasso <- LassoDevelopment$new(p)
Lasso$run()

[1] "AUC: 0.74"
[1] "95% CI AUC: (0.62,0.85)"
[1] "Grouped Lasso coefficients:"
       (Intercept) PatientEncounterID          PatientID      SystolicBPNBR 
        -3.0323774          0.0000000          0.0000000          0.0000000 
            LDLNBR             A1CNBR         GenderFLGM 
         0.0000000          0.2114222          0.0000000 
[1] "Variables with non-zero coefficients: A1CNBR"


The AUC is around 0.74, which isn't bad to start with. Note that in this simple example, features other than A1CNBR aren't helpful at all, so the final model (if lasso is chosen) can leave it out of the query. (See Example 2 for more.)

Now let's see if we can improve on that by testing a random forest model.

In [17]:
# Run Random Forest
rf <- RandomForestDevelopment$new(p)
rf$run()

Loading required package: e1071
Loading required package: ranger


[1] "AUC: 0.99"
[1] "95% CI AUC: (0.98,1)"
ranger variable importance

                   Overall
PatientEncounterID  100.00
A1CNBR               91.98
PatientID            84.32
SystolicBPNBR        69.80
LDLNBR               41.73
GenderFLG             0.00


Oh, interesting--AUC for random forest is significantly better than it was in the lasso model. Let's go to the Example #2, so we can save and deploy the random forest model.

Reach out to Levi Thatcher (levi.thatcher@healthcatalyst.com) if you have any questions!