### Automatic Machine Learning

This notebook ingests a dataset, and trains many machine learning models intelligently searching their parameters for optimal values. A leaderboard is maintained. Finally, an ensemble is created stacking together some of the base learners and the result is added to the leaderboard. The best model is used ion production. 


In [12]:
import h2o
from h2o.automl import H2OAutoML

In [13]:
%%capture
h2o.init(nthreads=1, max_mem_size=2)

In [14]:
# Import some data from Amazon S3
df = h2o.import_file("https://s3-us-west-1.amazonaws.com/dsclouddata/LendingClubData/LoansGoodBad.csv")

# Stratified Split into Train/Test
stratsplit = df["Bad_Loan"].stratified_split(test_frac=0.3, seed=12349453)
train = df[stratsplit=="train"]
test = df[stratsplit=="test"]


Parse progress: |█████████████████████████████████████████████████████████| 100%


In [15]:
test.head(10)

RowID,Loan_Amount,Term,Interest_Rate,Employment_Years,Home_Ownership,Annual_Income,Verification_Status,Loan_Purpose,State,Debt_to_Income,Delinquent_2yr,Revolving_Cr_Util,Total_Accounts,Bad_Loan,Longest_Credit_Length
1,5000,36 months,10.65,10.0,RENT,24000,VERIFIED - income,credit_card,AZ,27.65,0,83.7,9,0,26
7,5600,60 months,21.28,4.0,OWN,40000,VERIFIED - income source,small_business,CA,5.55,0,32.6,13,1,7
8,5375,60 months,12.69,0.5,RENT,15000,VERIFIED - income,other,TX,18.08,0,36.5,3,1,7
10,12000,36 months,12.69,10.0,OWN,75000,VERIFIED - income source,debt_consolidation,CA,10.78,0,67.1,34,0,22
11,9000,36 months,13.49,0.5,RENT,30000,VERIFIED - income source,debt_consolidation,VA,10.08,0,91.7,9,1,7
15,10000,36 months,15.27,4.0,RENT,42000,not verified,home_improvement,CA,18.6,0,70.2,28,0,13
16,3600,36 months,6.03,10.0,MORTGAGE,110000,not verified,major_purchase,CT,10.52,0,16.0,42,0,18
17,6000,36 months,11.71,1.0,MORTGAGE,84000,VERIFIED - income,medical,UT,18.44,2,37.73,14,0,8
20,10000,36 months,11.71,10.0,OWN,50000,VERIFIED - income source,credit_card,TX,11.18,0,82.4,21,0,26
29,31825,36 months,7.9,5.0,MORTGAGE,75000,VERIFIED - income,debt_consolidation,NJ,14.03,0,27.4,26,0,30




In [16]:
# Identify predictors and response
x = train.columns
y = "Bad_Loan"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

In [19]:
# Run AutoML for 20 minutes or until leader fails to improve after 5 rounds
autoModel = H2OAutoML(max_runtime_secs = 1200, stopping_rounds=5, stopping_tolerance=0.001)
autoModel.train(x = x, y = y,
          training_frame = train,
          leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


## Leaderboard
Display the best models, sorted by descending AUC

In [20]:
leaders = autoModel.leaderboard
leaders

C1,model_id,auc,logloss
0,StackedEnsemble_model_1496453118819_1922,0.714036,0.433196
1,GBM_grid__80d02ea5022d38991707ef2352da46d4_model_0,0.712403,0.4695
2,GLM_grid__80d02ea5022d38991707ef2352da46d4_model_1,0.702588,0.434005
3,GLM_grid__80d02ea5022d38991707ef2352da46d4_model_0,0.702441,0.437575
4,XRT_model_1496453118819_1099,0.699577,0.437954
5,DRF_model_1496453118819_925,0.69619,0.440296




In [None]:
preds = aml.predict(test)