<a href="https://colab.research.google.com/github/Bharat745/H2O/blob/master/GBM_Logistic_CKD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic_GBM_CKD

## Installation and Imports

In [0]:
# Installing java version for running H2O on colab
! apt-get install default-jre
!java -version

In [3]:
# Start and connect to local H2O cluster
! pip install h2o
import h2o
h2o.init(nthreads = -1)

Collecting h2o
[?25l  Downloading https://files.pythonhosted.org/packages/2a/05/cad6d1d8a4b0e85975aae28e61791510b7b963e7667143f33a63abbd5665/h2o-3.24.0.5.tar.gz (122.4MB)
[K     |████████████████████████████████| 122.4MB 1.4MB/s 
Collecting colorama>=0.3.8 (from h2o)
  Downloading https://files.pythonhosted.org/packages/4f/a6/728666f39bfff1719fc94c481890b2106837da9318031f71a8424b662e12/colorama-0.4.1-py2.py3-none-any.whl
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/fe/31/5a/d0a96f4ab19a49d5381707eb3017b1a41ec89acbeff54a532c
Successfully built h2o
Installing collected packages: colorama, h2o
Successfully installed colorama-0.4.1 h2o-3.24.0.5
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.3" 2019-04-16; OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu

0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.5
H2O cluster version age:,16 days
H2O cluster name:,H2O_from_python_unknownUser_6ohvwk
H2O cluster total nodes:,1
H2O cluster free memory:,2.938 Gb
H2O cluster total cores:,2
H2O cluster allowed cores:,2


## Problem Statement

The task involves identifying patients with Chronic Kidney Disease(CKD). We want to build a model that can predict whether the patient has CKD or not given his health details. The prediction is binary hence we develop a logistic regression model using GBM in h2o.

## Dataset

The data is a casestudy data having more than 8k rows (patients) and having 36 features(columns). The last column is CKD which is the one we have to predict. For the first 6000 rows, the value of CKD is available and for the rest we need topredict. Thus we split our data set accordingly into traning and testing. 

In [10]:
# Importing file in h2o
# you can find the file here 'https://github.com/Bharat745/Logistic-Regression/blob/master/casestudydata.csv'
data = h2o.import_file(path = "https://raw.githubusercontent.com/Bharat745/Logistic-Regression/master/casestudydata.csv")



Parse progress: |█████████████████████████████████████████████████████████| 100%


## Analysis/Modelling

In [11]:
data.shape

(8819, 34)

In [12]:
data.head(5)

ID,Age,Female,Racegrp,Educ,Unmarried,Income,CareSource,Insured,Weight,Height,BMI,Obese,Waist,SBP,DBP,HDL,LDL,Total Chol,Dyslipidemia,PVD,Activity,PoorVision,Smoker,Hypertension,Fam Hypertension,Diabetes,Fam Diabetes,Stroke,CVD,Fam CVD,CHF,Anemia,CKD
1,65,1,white,0,0.0,1.0,other,1,56.0,162.1,21.31,0,83.6,135,71,48,249,297,0,0,3,0,1,0,0,0,1,0,1,0,0,0,0
2,36,1,hispa,0,,1.0,noplace,0,60.2,162.2,22.88,0,76.6,96,52,31,135,166,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0
3,66,1,white,0,1.0,0.0,noplace,1,83.9,162.5,31.77,1,113.2,115,57,44,211,255,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0
4,54,1,white,1,0.0,0.0,DrHMO,1,69.4,160.5,26.94,0,77.9,110,57,74,156,230,0,0,2,0,1,0,0,0,0,0,0,0,0,0,0
5,63,1,black,0,0.0,,clinic,1,73.1,159.2,28.84,0,89.3,132,73,67,154,221,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0




In [13]:
data.types

{'Activity': 'int',
 'Age': 'int',
 'Anemia': 'int',
 'BMI': 'real',
 'CHF': 'int',
 'CKD': 'int',
 'CVD': 'int',
 'CareSource': 'enum',
 'DBP': 'int',
 'Diabetes': 'int',
 'Dyslipidemia': 'int',
 'Educ': 'int',
 'Fam CVD': 'int',
 'Fam Diabetes': 'int',
 'Fam Hypertension': 'int',
 'Female': 'int',
 'HDL': 'int',
 'Height': 'real',
 'Hypertension': 'int',
 'ID': 'int',
 'Income': 'int',
 'Insured': 'int',
 'LDL': 'int',
 'Obese': 'int',
 'PVD': 'int',
 'PoorVision': 'int',
 'Racegrp': 'enum',
 'SBP': 'int',
 'Smoker': 'int',
 'Stroke': 'int',
 'Total Chol': 'int',
 'Unmarried': 'int',
 'Waist': 'real',
 'Weight': 'real'}

In [0]:
# Converting the CKD column as factor because it is either 1 or 0 i.e person has CKD or not
data["CKD"] = data["CKD"].asfactor()

In [0]:
# Dropping the ID column as it is not useful
data = data.drop([0], axis=1)

In [16]:
data.head(3)

Age,Female,Racegrp,Educ,Unmarried,Income,CareSource,Insured,Weight,Height,BMI,Obese,Waist,SBP,DBP,HDL,LDL,Total Chol,Dyslipidemia,PVD,Activity,PoorVision,Smoker,Hypertension,Fam Hypertension,Diabetes,Fam Diabetes,Stroke,CVD,Fam CVD,CHF,Anemia,CKD
65,1,white,0,0.0,1,other,1,56.0,162.1,21.31,0,83.6,135,71,48,249,297,0,0,3,0,1,0,0,0,1,0,1,0,0,0,0
36,1,hispa,0,,1,noplace,0,60.2,162.2,22.88,0,76.6,96,52,31,135,166,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0
66,1,white,0,1.0,0,noplace,1,83.9,162.5,31.77,1,113.2,115,57,44,211,255,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0




In [17]:
# Descriptive analysis to understand the data
data.summary()

Unnamed: 0,Age,Female,Racegrp,Educ,Unmarried,Income,CareSource,Insured,Weight,Height,BMI,Obese,Waist,SBP,DBP,HDL,LDL,Total Chol,Dyslipidemia,PVD,Activity,PoorVision,Smoker,Hypertension,Fam Hypertension,Diabetes,Fam Diabetes,Stroke,CVD,Fam CVD,CHF,Anemia,CKD
type,int,int,enum,int,int,int,enum,int,real,real,real,int,real,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,enum
mins,20.0,0.0,,0.0,0.0,0.0,,0.0,25.6,130.4,12.04,0.0,58.5,72.0,10.0,8.0,27.0,72.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
mean,49.358203877990704,0.5272706656083456,,0.43141266052960564,0.3685908927931158,0.4172220044427022,,0.8045026418561911,79.09434202898548,167.0268544274455,28.294347520225106,0.31574627740649547,96.83997648442094,125.81083303959585,71.50894655764903,51.82833446943877,152.571298716055,204.40338520958764,0.10545413312166912,0.039233473182900554,2.028153025315018,0.06386330586524479,0.3041161129379748,0.40141892665064655,0.23324639981857354,0.11137575138936145,0.31171334618437463,0.03144868301544051,0.06639381537062301,0.3432142857142857,0.02891950358647387,0.02042437308521502,
maxs,85.0,1.0,,1.0,1.0,1.0,,1.0,193.3,200.1,66.44,1.0,173.4,233.0,132.0,160.0,684.0,727.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
sigma,18.82872580982025,0.49928406499674366,,0.495301582594917,0.4824515160005016,0.4931324154532263,,0.3966058606049539,19.4107068958128,10.099769646593007,6.186076206777893,0.4648396497395821,15.099678091159547,21.034785808424957,12.663170963647858,15.786402387105595,42.994626395282964,42.79774750436619,0.30715510213878133,0.19416097047018913,0.8150365143496283,0.24452408836640444,0.4600581512225135,0.49021349477899856,0.42292174018200457,0.31461471629098425,0.4632196743420641,0.17453687841335183,0.24898338194043482,0.47481057117100844,0.16758986755950545,0.14145489924611554,
zeros,0,4169,,5003,5283,4460,,1702,0,0,0,5836,0,0,0,0,0,0,7889,8473,0,7725,6137,5231,6762,7835,6070,8531,8212,5517,8529,8633,
missing,0,0,0,20,452,1166,3,113,194,191,290,290,314,308,380,17,18,16,0,0,10,567,0,80,0,2,0,11,23,419,36,6,2819
0,65.0,1.0,white,0.0,0.0,1.0,other,1.0,56.0,162.1,21.31,0.0,83.6,135.0,71.0,48.0,249.0,297.0,0.0,0.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0
1,36.0,1.0,hispa,0.0,,1.0,noplace,0.0,60.2,162.2,22.88,0.0,76.6,96.0,52.0,31.0,135.0,166.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,66.0,1.0,white,0.0,1.0,0.0,noplace,1.0,83.9,162.5,31.77,1.0,113.2,115.0,57.0,44.0,211.0,255.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0


### GBM Model

In [0]:
# Importing GBM Estimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [0]:
# splitting data into train and test
# Splitting the data as such because the first 6000 rows, we have the answer of weather a person has CKD or not
train, test = data.split_frame(
    ratios = [0.68],
    destination_frames = ["data_train" , "data_test"],
    seed =123
    )

In [20]:
print("%d/%d" % (train.nrows , test.nrows))

5999/2820


In [0]:
# Define the features and target
x = list(train.columns)
y = "CKD"

In [22]:
# Train the Gradient Boosting Model
m1 = H2OGradientBoostingEstimator(seed= 1234)
m1.train(x, y ,training_frame = train)
print(m1)

gbm Model Build progress: |███████████████████████████████████████████████| 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1562353934678_1


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.027576011305050212
RMSE: 0.16606026407617872
LogLoss: 0.10554276990271674
Mean Per-Class Error: 0.07593017838125671
AUC: 0.9783059604632588
pr_auc: 0.8872947907524608
Gini: 0.9566119209265176
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.37204500891332465: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,3737.0,19.0,0.0051,(19.0/3756.0)
1,83.0,237.0,0.2594,(83.0/320.0)
Total,3820.0,256.0,0.025,(102.0/4076.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3720450,0.8229167,133.0
max f2,0.2710757,0.8132344,167.0
max f0point5,0.4114042,0.8922956,124.0
max accuracy,0.3779190,0.9749755,132.0
max precision,0.9327753,1.0,0.0
max recall,0.0077871,1.0,384.0
max specificity,0.9327753,1.0,0.0
max absolute_mcc,0.3720450,0.8154701,133.0
max min_per_class_accuracy,0.1437627,0.9203940,234.0


Gains/Lift Table: Avg response rate:  7.85 %, avg score:  7.88 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0147203,0.7357147,12.7375000,12.7375000,1.0,0.8114940,1.0,0.8114940,0.1875,0.1875,1173.75,1173.75
,2,0.0294406,0.6410291,12.7375000,12.7375000,1.0,0.6872909,1.0,0.7493925,0.1875,0.375,1173.75,1173.75
,3,0.0441609,0.5306833,12.1006250,12.5252083,0.95,0.5939314,0.9833333,0.6975721,0.178125,0.553125,1110.0625,1152.5208333
,4,0.0588813,0.4019564,10.6145833,12.0475521,0.8333333,0.4783013,0.9458333,0.6427544,0.15625,0.709375,961.4583333,1104.7552083
,5,0.0736016,0.3121599,5.5195833,10.7419583,0.4333333,0.3533304,0.8433333,0.5848696,0.08125,0.790625,451.9583333,974.1958333
,6,0.1472031,0.1393123,1.8257083,6.2838333,0.1433333,0.2047313,0.4933333,0.3948004,0.134375,0.925,82.5708333,528.3833333
,7,0.2208047,0.0766102,0.3821250,4.3165972,0.03,0.1069139,0.3388889,0.2988382,0.028125,0.953125,-61.7875000,331.6597222
,8,0.2944063,0.0438436,0.3396667,3.3223646,0.0266667,0.0576452,0.2608333,0.2385400,0.025,0.978125,-66.0333333,232.2364583
,9,0.4416094,0.0177253,0.127375,2.2573681,0.01,0.0284583,0.1772222,0.1685127,0.01875,0.996875,-87.2625,125.7368056



Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2019-07-05 19:41:21,0.048 sec,0.0,0.2689699,0.2751110,0.5,0.0,1.0,0.9214917
,2019-07-05 19:41:21,0.378 sec,1.0,0.2602753,0.2494960,0.8910531,0.4574103,9.1285417,0.0996075
,2019-07-05 19:41:21,0.513 sec,2.0,0.2540177,0.2344933,0.9058382,0.4785623,8.7342857,0.0873405
,2019-07-05 19:41:21,0.589 sec,3.0,0.2484476,0.2228639,0.9102786,0.4992984,9.6053279,0.0863592
,2019-07-05 19:41:21,0.658 sec,4.0,0.2433270,0.2128812,0.9180661,0.5556439,10.826875,0.0907753
---,---,---,---,---,---,---,---,---,---
,2019-07-05 19:41:25,3.768 sec,46.0,0.1698689,0.1096048,0.9757546,0.8760930,12.7375000,0.0279686
,2019-07-05 19:41:25,3.831 sec,47.0,0.1690108,0.1086403,0.9764531,0.8790096,12.7375000,0.0267419
,2019-07-05 19:41:25,3.892 sec,48.0,0.1680927,0.1077674,0.9766540,0.8805277,12.7375000,0.0260059



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Age,195.8016510,1.0,0.3791444
Height,37.1978569,0.1899772,0.0720288
SBP,34.3306313,0.1753337,0.0664768
HDL,27.5026779,0.1404619,0.0532553
DBP,26.5593071,0.1356439,0.0514286
---,---,---,---
Income,1.4221861,0.0072634,0.0027539
Insured,0.4855364,0.0024797,0.0009402
Dyslipidemia,0.3235362,0.0016524,0.0006265



See the whole table with table.as_data_frame()



From the above model we see that AUC is 97% which is pretty high suggesting that there may be some overfitting in the model. Age is the most important variable among others to make the prediction. the confusion matrix shows that 102/4076 were classified wrong.

In [23]:
# Using the Gradient Boosting model to make predictions
p = m1.predict(test)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [24]:
# We get binary prediction. p here gives the probablity of whether modelclassifies a person has CKD or not
p

predict,p0,p1
0,0.974083,0.0259165
0,0.95072,0.0492802
0,0.640439,0.359561
0,0.98119,0.0188098
0,0.980248,0.0197521
0,0.994643,0.00535702
0,0.993395,0.00660525
0,0.994011,0.00598851
0,0.952136,0.0478642
0,0.993922,0.0060782




In [25]:
# Checking the model performance on the test data
perf1 = m1.model_performance(test)
perf1


ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.05692478676740877
RMSE: 0.23858915894777946
LogLoss: 0.1918723161944375
Mean Per-Class Error: 0.17478152309612982
AUC: 0.884917290886392
pr_auc: 0.35014674073383006
Gini: 0.7698345817727841
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15940139835805198: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1611.0,169.0,0.0949,(169.0/1780.0)
1,55.0,89.0,0.3819,(55.0/144.0)
Total,1666.0,258.0,0.1164,(224.0/1924.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1594014,0.4427861,176.0
max f2,0.0659746,0.5825243,265.0
max f0point5,0.3022979,0.4113924,98.0
max accuracy,0.8192316,0.9261954,1.0
max precision,0.8290710,1.0,0.0
max recall,0.0061033,1.0,385.0
max specificity,0.8290710,1.0,0.0
max absolute_mcc,0.0940312,0.4043354,234.0
max min_per_class_accuracy,0.0688825,0.8168539,262.0


Gains/Lift Table: Avg response rate:  7.48 %, avg score:  6.89 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0083160,0.6702011,5.8454861,5.8454861,0.4375,0.7457955,0.4375,0.7457955,0.0486111,0.0486111,484.5486111,484.5486111
,2,0.0166320,0.5639000,5.8454861,5.8454861,0.4375,0.5983342,0.4375,0.6720649,0.0486111,0.0972222,484.5486111,484.5486111
,3,0.0285863,0.4993679,6.3900966,6.0732323,0.4782609,0.5358347,0.4545455,0.6150959,0.0763889,0.1736111,539.0096618,507.3232323
,4,0.0363825,0.4262993,6.2351852,6.1079365,0.4666667,0.4591699,0.4571429,0.5816832,0.0486111,0.2222222,523.5185185,510.7936508
,5,0.0415800,0.3684910,5.3444444,6.0125,0.4,0.4026526,0.45,0.5593044,0.0277778,0.25,434.4444444,501.25
,6,0.0940748,0.2275673,4.2332233,5.0196440,0.3168317,0.2925044,0.3756906,0.4104270,0.2222222,0.4722222,323.3223322,401.9643953
,7,0.1439709,0.1449389,3.2010995,4.3893903,0.2395833,0.1850562,0.3285199,0.3323202,0.1597222,0.6319444,220.1099537,338.9390293
,8,0.1902287,0.0987149,2.2518727,3.8696114,0.1685393,0.1206629,0.2896175,0.2808516,0.1041667,0.7361111,125.1872659,286.9611415
,9,0.2874220,0.0462811,1.5004456,3.0684649,0.1122995,0.0661253,0.2296564,0.2082407,0.1458333,0.8819444,50.0445633,206.8464939







The first model is a normal GBM with traning and testing data. This is done to show what the model looks like and what we are trying to predict at the end. We are trying to predict if a person has a kidney disease or not, given his health details. 

### Cross Validation

In [0]:
# Importing GBM Estimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [27]:
# 9 fold cross validation
m2 = H2OGradientBoostingEstimator(model_id = "def9folds" , nfolds = 9)
m2.train(x, y, train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [28]:
m2

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  def9folds


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.027576011305050212
RMSE: 0.16606026407617872
LogLoss: 0.10554276990271674
Mean Per-Class Error: 0.07593017838125671
AUC: 0.9783059604632588
pr_auc: 0.8872947907524608
Gini: 0.9566119209265176
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.37204500891332465: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,3737.0,19.0,0.0051,(19.0/3756.0)
1,83.0,237.0,0.2594,(83.0/320.0)
Total,3820.0,256.0,0.025,(102.0/4076.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3720450,0.8229167,133.0
max f2,0.2710757,0.8132344,167.0
max f0point5,0.4114042,0.8922956,124.0
max accuracy,0.3779190,0.9749755,132.0
max precision,0.9327753,1.0,0.0
max recall,0.0077871,1.0,384.0
max specificity,0.9327753,1.0,0.0
max absolute_mcc,0.3720450,0.8154701,133.0
max min_per_class_accuracy,0.1437627,0.9203940,234.0


Gains/Lift Table: Avg response rate:  7.85 %, avg score:  7.88 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0147203,0.7357147,12.7375000,12.7375000,1.0,0.8114940,1.0,0.8114940,0.1875,0.1875,1173.75,1173.75
,2,0.0294406,0.6410291,12.7375000,12.7375000,1.0,0.6872909,1.0,0.7493925,0.1875,0.375,1173.75,1173.75
,3,0.0441609,0.5306833,12.1006250,12.5252083,0.95,0.5939314,0.9833333,0.6975721,0.178125,0.553125,1110.0625,1152.5208333
,4,0.0588813,0.4019564,10.6145833,12.0475521,0.8333333,0.4783013,0.9458333,0.6427544,0.15625,0.709375,961.4583333,1104.7552083
,5,0.0736016,0.3121599,5.5195833,10.7419583,0.4333333,0.3533304,0.8433333,0.5848696,0.08125,0.790625,451.9583333,974.1958333
,6,0.1472031,0.1393123,1.8257083,6.2838333,0.1433333,0.2047313,0.4933333,0.3948004,0.134375,0.925,82.5708333,528.3833333
,7,0.2208047,0.0766102,0.3821250,4.3165972,0.03,0.1069139,0.3388889,0.2988382,0.028125,0.953125,-61.7875000,331.6597222
,8,0.2944063,0.0438436,0.3396667,3.3223646,0.0266667,0.0576452,0.2608333,0.2385400,0.025,0.978125,-66.0333333,232.2364583
,9,0.4416094,0.0177253,0.127375,2.2573681,0.01,0.0284583,0.1772222,0.1685127,0.01875,0.996875,-87.2625,125.7368056




ModelMetricsBinomial: gbm
** Reported on cross-validation data. **

MSE: 0.05922501821200946
RMSE: 0.24336190789030535
LogLoss: 0.20154039217679368
Mean Per-Class Error: 0.18972144568690097
AUC: 0.8739733093716721
pr_auc: 0.3674183108799275
Gini: 0.7479466187433441
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.16518300163321958: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,3386.0,370.0,0.0985,(370.0/3756.0)
1,129.0,191.0,0.4031,(129.0/320.0)
Total,3515.0,561.0,0.1224,(499.0/4076.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1651830,0.4335982,215.0
max f2,0.0836700,0.5715592,275.0
max f0point5,0.4303442,0.4096639,90.0
max accuracy,0.5413436,0.9239450,52.0
max precision,0.9379126,1.0,0.0
max recall,0.0037025,1.0,398.0
max specificity,0.9379126,1.0,0.0
max absolute_mcc,0.0836700,0.3920866,275.0
max min_per_class_accuracy,0.0705881,0.8056443,286.0


Gains/Lift Table: Avg response rate:  7.85 %, avg score:  7.30 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0098135,0.6565561,6.6871875,6.6871875,0.525,0.7455347,0.525,0.7455347,0.065625,0.065625,568.71875,568.71875
,2,0.0196271,0.5496042,7.3240625,7.005625,0.575,0.6063185,0.55,0.6759266,0.071875,0.1375,632.4062500,600.5625
,3,0.0289500,0.4713338,5.3631579,6.4766949,0.4210526,0.5141451,0.5084746,0.6238275,0.05,0.1875,436.3157895,547.6694915
,4,0.0392542,0.4280005,5.7622024,6.2891406,0.4523810,0.4479167,0.49375,0.5776509,0.059375,0.246875,476.2202381,528.9140625
,5,0.0505397,0.3916058,3.0459239,5.5649272,0.2391304,0.4087106,0.4368932,0.5399264,0.034375,0.28125,204.5923913,456.4927184
,6,0.0978901,0.2408138,3.9598446,4.7885338,0.3108808,0.3073933,0.3759398,0.4274480,0.1875,0.46875,295.9844560,378.8533835
,7,0.1464671,0.1527994,3.0878788,4.2244975,0.2424242,0.1911717,0.3316583,0.3490850,0.15,0.61875,208.7878788,322.4497487
,8,0.1960255,0.1011202,2.0808787,3.6825563,0.1633663,0.1210925,0.2891114,0.2914448,0.103125,0.721875,108.0878713,268.2556320
,9,0.2963690,0.0459237,1.4014364,2.9102235,0.1100244,0.0703995,0.2284768,0.2166041,0.140625,0.8625,40.1436430,191.0223510



Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7,8,9,10,11
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid,cv_9_valid
accuracy,0.8944479,0.0294786,0.8721973,0.9241706,0.8097345,0.9273128,0.8543046,0.8867102,0.9568034,0.9090909,0.9097065
auc,0.8778754,0.0121869,0.8919153,0.8751516,0.8780404,0.8987265,0.8894215,0.8851211,0.8860573,0.8448558,0.8515891
err,0.1055521,0.0294786,0.1278027,0.0758294,0.1902655,0.0726872,0.1456954,0.1132898,0.0431965,0.0909091,0.0902934
err_count,47.77778,13.353226,57.0,32.0,86.0,33.0,66.0,52.0,20.0,44.0,40.0
f0point5,0.4486538,0.0530879,0.4334365,0.5072464,0.3023758,0.5025126,0.4216867,0.3900709,0.5729167,0.4932736,0.4143646
f1,0.4811082,0.0324869,0.4955752,0.4666667,0.3943662,0.5479452,0.5147059,0.4583333,0.5238095,0.5,0.4285714
f2,0.5365456,0.0505676,0.5785124,0.4320988,0.5668016,0.6024097,0.6603774,0.5555556,0.4824561,0.5069125,0.4437870
lift_top_group,6.95878,1.5480382,5.9269104,8.865546,5.5346937,8.368664,3.0099669,7.714286,11.023809,6.431894,5.753247
logloss,0.2014503,0.0197531,0.2232611,0.2004143,0.2094303,0.1667606,0.2315405,0.1928828,0.1472274,0.2382921,0.2032436


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2019-07-05 19:49:29,20.886 sec,0.0,0.2689699,0.2751110,0.5,0.0,1.0,0.9214917
,2019-07-05 19:49:29,20.924 sec,1.0,0.2602753,0.2494960,0.8910531,0.4574103,9.1285417,0.0996075
,2019-07-05 19:49:29,20.961 sec,2.0,0.2540177,0.2344933,0.9058382,0.4785623,8.7342857,0.0873405
,2019-07-05 19:49:29,20.997 sec,3.0,0.2484476,0.2228639,0.9102786,0.4992984,9.6053279,0.0863592
,2019-07-05 19:49:29,21.045 sec,4.0,0.2433270,0.2128812,0.9180661,0.5556439,10.826875,0.0907753
---,---,---,---,---,---,---,---,---,---
,2019-07-05 19:49:31,22.371 sec,46.0,0.1698689,0.1096048,0.9757546,0.8760930,12.7375000,0.0279686
,2019-07-05 19:49:31,22.401 sec,47.0,0.1690108,0.1086403,0.9764531,0.8790096,12.7375000,0.0267419
,2019-07-05 19:49:31,22.429 sec,48.0,0.1680927,0.1077674,0.9766540,0.8805277,12.7375000,0.0260059



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Age,195.8016510,1.0,0.3791444
Height,37.1978569,0.1899772,0.0720288
SBP,34.3306313,0.1753337,0.0664768
HDL,27.5026779,0.1404619,0.0532553
DBP,26.5593071,0.1356439,0.0514286
---,---,---,---
Income,1.4221861,0.0072634,0.0027539
Insured,0.4855364,0.0024797,0.0009402
Dyslipidemia,0.3235362,0.0016524,0.0006265



See the whole table with table.as_data_frame()




In [29]:
perf2 = m2.model_performance(test)
perf2


ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.05692478676740877
RMSE: 0.23858915894777946
LogLoss: 0.1918723161944375
Mean Per-Class Error: 0.17478152309612982
AUC: 0.884917290886392
pr_auc: 0.35014674073383006
Gini: 0.7698345817727841
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15940139835805198: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1611.0,169.0,0.0949,(169.0/1780.0)
1,55.0,89.0,0.3819,(55.0/144.0)
Total,1666.0,258.0,0.1164,(224.0/1924.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1594014,0.4427861,176.0
max f2,0.0659746,0.5825243,265.0
max f0point5,0.3022979,0.4113924,98.0
max accuracy,0.8192316,0.9261954,1.0
max precision,0.8290710,1.0,0.0
max recall,0.0061033,1.0,385.0
max specificity,0.8290710,1.0,0.0
max absolute_mcc,0.0940312,0.4043354,234.0
max min_per_class_accuracy,0.0688825,0.8168539,262.0


Gains/Lift Table: Avg response rate:  7.48 %, avg score:  6.89 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0083160,0.6702011,5.8454861,5.8454861,0.4375,0.7457955,0.4375,0.7457955,0.0486111,0.0486111,484.5486111,484.5486111
,2,0.0166320,0.5639000,5.8454861,5.8454861,0.4375,0.5983342,0.4375,0.6720649,0.0486111,0.0972222,484.5486111,484.5486111
,3,0.0285863,0.4993679,6.3900966,6.0732323,0.4782609,0.5358347,0.4545455,0.6150959,0.0763889,0.1736111,539.0096618,507.3232323
,4,0.0363825,0.4262993,6.2351852,6.1079365,0.4666667,0.4591699,0.4571429,0.5816832,0.0486111,0.2222222,523.5185185,510.7936508
,5,0.0415800,0.3684910,5.3444444,6.0125,0.4,0.4026526,0.45,0.5593044,0.0277778,0.25,434.4444444,501.25
,6,0.0940748,0.2275673,4.2332233,5.0196440,0.3168317,0.2925044,0.3756906,0.4104270,0.2222222,0.4722222,323.3223322,401.9643953
,7,0.1439709,0.1449389,3.2010995,4.3893903,0.2395833,0.1850562,0.3285199,0.3323202,0.1597222,0.6319444,220.1099537,338.9390293
,8,0.1902287,0.0987149,2.2518727,3.8696114,0.1685393,0.1206629,0.2896175,0.2808516,0.1041667,0.7361111,125.1872659,286.9611415
,9,0.2874220,0.0462811,1.5004456,3.0684649,0.1122995,0.0661253,0.2296564,0.2082407,0.1458333,0.8819444,50.0445633,206.8464939







### Over Fitting


In [30]:
m3 = H2OGradientBoostingEstimator(model_id = "def9folds" , 
                                  max_depth = 10,
                                  ntrees = 500,
                                  nfolds = 6)
m3.train(x, y, train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [31]:
perf3 = m3.model_performance(test)
perf3


ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.06933039847886922
RMSE: 0.26330666242780343
LogLoss: 0.7353627009931729
Mean Per-Class Error: 0.20460362047440706
AUC: 0.8494908707865169
pr_auc: 0.3503867096191803
Gini: 0.6989817415730337
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.0002517106903091874: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1662.0,118.0,0.0663,(118.0/1780.0)
1,71.0,73.0,0.4931,(71.0/144.0)
Total,1733.0,191.0,0.0982,(189.0/1924.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.0002517,0.4358209,190.0
max f2,0.0000012,0.5458290,355.0
max f0point5,0.2514877,0.4207921,64.0
max accuracy,0.9965195,0.9272349,15.0
max precision,0.9999980,1.0,0.0
max recall,0.0000000,1.0,399.0
max specificity,0.9999980,1.0,0.0
max absolute_mcc,0.0002517,0.3877674,190.0
max min_per_class_accuracy,0.0000001,0.7803371,385.0


Gains/Lift Table: Avg response rate:  7.48 %, avg score:  2.82 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0093555,0.9962791,7.4228395,7.4228395,0.5555556,0.9988760,0.5555556,0.9988760,0.0694444,0.0694444,642.2839506,642.2839506
,2,0.0197505,0.7849909,6.0125,6.6805556,0.45,0.9350055,0.5,0.9652599,0.0625,0.1319444,501.25,568.0555556
,3,0.0306653,0.3670867,6.9986772,6.7937853,0.5238095,0.5604197,0.5084746,0.8211643,0.0763889,0.2083333,599.8677249,579.3785311
,4,0.0405405,0.1026891,5.6257310,6.5092593,0.4210526,0.2161018,0.4871795,0.6737772,0.0555556,0.2638889,462.5730994,550.9259259
,5,0.0509356,0.0340222,3.3402778,5.8625283,0.25,0.0551944,0.4387755,0.5475358,0.0347222,0.2986111,234.0277778,486.2528345
,6,0.0925156,0.0004508,3.8413194,4.9541199,0.2875,0.0068273,0.3707865,0.3045208,0.1597222,0.4583333,284.1319444,395.4119850
,7,0.1439709,0.0000158,2.4292929,4.0517449,0.1818182,0.0001135,0.3032491,0.1957254,0.125,0.5833333,142.9292929,305.1744886
,8,0.1943867,0.0000020,2.4793814,3.6439394,0.1855670,0.0000064,0.2727273,0.1449640,0.125,0.7083333,147.9381443,264.3939394
,9,0.2936590,0.0000001,1.2591623,2.8377581,0.0942408,0.0000005,0.2123894,0.0959586,0.125,0.8333333,25.9162304,183.7758112







In [0]:
# Analysing model performance on training data and test data 
# m1, m2, m3 are the models training data
# perf 1,2,3 are performance of models on test data 

# Training model performance

In [33]:
m1.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.37204500891332465: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,3737.0,19.0,0.0051,(19.0/3756.0)
1,83.0,237.0,0.2594,(83.0/320.0)
Total,3820.0,256.0,0.025,(102.0/4076.0)




In [34]:
m2.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.37204500891332465: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,3737.0,19.0,0.0051,(19.0/3756.0)
1,83.0,237.0,0.2594,(83.0/320.0)
Total,3820.0,256.0,0.025,(102.0/4076.0)




In [35]:
m3.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.9999996058366459: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,3756.0,0.0,0.0,(0.0/3756.0)
1,0.0,320.0,0.0,(0.0/320.0)
Total,3756.0,320.0,0.0,(0.0/4076.0)




As we can see that there is no big difference between simple model and crossvalidation. but when we overfit the model o training data, there is 0 error rate which is not good for the model. 

# Testing Model Performance

In [36]:
perf1.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15940139835805198: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1611.0,169.0,0.0949,(169.0/1780.0)
1,55.0,89.0,0.3819,(55.0/144.0)
Total,1666.0,258.0,0.1164,(224.0/1924.0)




In [37]:
perf2.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15940139835805198: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1611.0,169.0,0.0949,(169.0/1780.0)
1,55.0,89.0,0.3819,(55.0/144.0)
Total,1666.0,258.0,0.1164,(224.0/1924.0)




In [38]:
perf3.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.0002517106903091874: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1662.0,118.0,0.0663,(118.0/1780.0)
1,71.0,73.0,0.4931,(71.0/144.0)
Total,1733.0,191.0,0.0982,(189.0/1924.0)




When we apply the models totest data, we see that there is no difference between normal GBM model and cross validation. but it gives better results for the overfit model which is as expected. 