This package helps one compare and deploy models in two steps: 1) compare models built on most of your data (we have to hold some rows out for checking the accuracy) and 2) pick the best approach, build this model using all of your data, save the model, and deploy predictions on test data to SQL Server.

To begin step #1, we make a connection and load in data from SQL Server.

In [30]:
from hcpytools.develop_supervised_model import DevelopSupervisedModel
from hcpytools.deploy_supervised_model import DeploySupervisedModel
import pandas as pd
import pyodbc
import random

cnxn = pyodbc.connect(SERVER='localhost',
                      DRIVER='{SQL Server Native Client 11.0}',
                      Trusted_Connection='yes')

df = pd.read_sql("""SELECT
                      [OrganizationLevel]
                      ,[MaritalStatus]
                      ,[Gender]
                      --Predicted col has to be Y/N
                      ,IIF([SalariedFlag]=1,'Y','N') as SalariedFlag 
                      ,[VacationHours]
                      ,[SickLeaveHours]
                    FROM [AdventureWorks2012].[HumanResources].[Employee]""",
                 cnxn)

Let's see the we've loaded in.

In [31]:
print(df.head())

   OrganizationLevel MaritalStatus Gender SalariedFlag  VacationHours  \
0                  0             S      M            Y             99   
1                  1             S      F            Y              1   
2                  2             M      M            Y              2   
3                  3             S      M            N             48   
4                  3             M      F            Y              5   

   SickLeaveHours  
0              69  
1              20  
2              21  
3              80  
4              22  


OK that looks good. What about column types?

In [32]:
print(df.dtypes)

OrganizationLevel     int64
MaritalStatus        object
Gender               object
SalariedFlag         object
VacationHours         int64
SickLeaveHours        int64
dtype: object


Looks pretty good, but let's say we had to change an int to a factor column (which might happen if the factor column is 0,1,2, etc). Also, we'll change an object (factor) col to a float. This is how:

In [33]:
df['Gender'] = df['Gender'].astype(object) # changing to factor
df['VacationHours'] = df['VacationHours'].astype(float) # to float

Do proprocessing and split data into train/test, and store result in object.

In [34]:
random.seed(43) # <-- used to make results reproducible
o = DevelopSupervisedModel(modeltype='classification',
                           df=df,
                           predictedcol='SalariedFlag',
                           graincol='',#OPTIONAL/ENCOURAGED
                           impute=True,
                           debug=False)

Now that we've arranged the data and done imputation, let's create a logistic model and see how accurate it is. Note that the method is called linear, but it's classification when that argument is set above.

In [35]:
o.linear(cores=1,
         debug=False)


 LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
Best hyper-parameters found after tuning:
No hyper-parameter tuning was done.

AUC Score: 0.858630952381 



Interesting, so an AUC above 0.8 is fairly predictive, so the linear model did fairly well. (You'll note the cell above also specifies model details.)

While we've already done well, let's see how well a random forest does:

In [36]:
o.randomforest(cores=1,
               debug=False)


 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Best hyper-parameters found after tuning:
No hyper-parameter tuning was done.

AUC Score: 0.902529761905 

Variable importance:
1. OrganizationLevel (0.488065)
2. VacationHours (0.239089)
3. SickLeaveHours (0.210164)
4. Gender.M (0.032773)
5. MaritalStatus.S (0.029909)


Oh, so that's interesting--random forest does even better with an AUC of 0.91. This means we'll choose to use the random forest model for nightly predictions. Random forest also gives us some guidance as to which variables are most important. If you have features that contribute below 0.1 in the variable importance list, you can safely leave them out of the deploy step (see the next example).

Reach out to Levi Thatcher (levi.thatcher@healthcatalyst.com) if you have any questions!