In [1]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,1 hour 32 mins
H2O cluster timezone:,Africa/Luanda
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.2
H2O cluster version age:,15 days
H2O cluster name:,H2O_from_python_maria_7ot81p
H2O cluster total nodes:,1
H2O cluster free memory:,801 Mb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# Problem Description

Based on some available features, we will like to determine if some people will click on our AD when using the internet.

In [4]:
data = h2o.import_file("advertising.csv")
data.head(5)# The default head() command displays the first 10 rows.

Parse progress: |█████████████████████████████████████████████████████████| 100%


Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
80.23,31,68441.9,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
69.47,26,59785.9,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
74.15,29,54806.2,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
68.37,35,73890.0,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0




# Important features about H2o.AI 
(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html)

1) Standardization: The Standardization of numerical values is enabled by default. If you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. Only advanced users should disable this option.

2) Missing values handling: In H2O, the Deep Learning and GLM algorithms will either skip or mean-impute rows with NA values. This option defaults to MeanImputation. But you can select the skip option.

3) Unwanted Columns: The ignored_columns parameter is used to specify an array of column names that should be ignored.

4) Collinearity : Collinear columns can cause problems during model fitting. The preferred way to deal with collinearity (and the default H2O behavior) is to add regularization. However, if you want a non-regularized solution, you can choose to automatically remove collinear columns by enabling the "remove_collinear_columns" option. This option can only be used when solver=IRLSM and with no regularization (lambda=0).

In [5]:
# check the number of missing values
#Depending on the selected missing value handling policy, 
#they are either imputed mean or the whole row is skipped. The default behavior is Mean Imputation. 
print('missing:', data.isna().sum())

missing: 0.0


In [6]:
data.describe()

Rows:1000
Cols:10




Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
type,real,int,real,real,string,string,int,enum,time,int
mins,32.6,19.0,13996.5,104.78,,,0.0,,1451616730000.0,0.0
mean,65.0002,36.009,55000.00008,180.00009999999997,,,0.481,,1460284446636.0,0.5
maxs,91.43,61.0,79484.8,269.96,,,1.0,,1469319736000.0,1.0
sigma,15.853614567500209,8.78556231012592,13414.634022282358,43.9023393019801,,,0.49988887654046565,,5089978400.060859,0.5002501876563868
zeros,0,0,0,0,0,0,519,,0,500
missing,0,0,0,0,0,0,0,0,0,0
0,68.95,35.0,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0.0,Tunisia,2016-03-27 00:53:11,0.0
1,80.23,31.0,68441.85,193.77,Monitored national standardization,West Jodi,1.0,Nauru,2016-04-04 01:39:02,0.0
2,69.47,26.0,59785.94,236.5,Organic bottom-line service-desk,Davidton,0.0,San Marino,2016-03-13 20:35:42,0.0


In [7]:
i_split = data.split_frame(ratios = [0.8], seed = 1234)
train = i_split[0] # using 80% for training
test = i_split[1] #rest 20% for testing
print(train.shape, test.shape)

(802, 10) (198, 10)


In [12]:
#set the response column to to levels
data['Clicked on Ad'] = data['Clicked on Ad'].asfactor()

# set the predictor names and the response column name
predictors= ['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']
response = 'Clicked on Ad'

# split into train and validation sets
#train, valid= airlines.split_frame(ratios = [.8], seed = 1234)
i_split = data.split_frame(ratios = [0.8], seed = 1234)
train = i_split[0] # using 80% for training
test = i_split[1] #rest 20% for testing


In [1]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

glm_model = H2OGeneralizedLinearEstimator(family= "binomial", compute_p_values = True)
glm_model.train(predictors, response, training_frame= train)
print(glm_model)


NameError: name 'predictors' is not defined

In [15]:
# Coefficients that can be applied to the non-standardized data.
print(glm_model.coef())

{'Intercept': 27.87692087078464, 'Daily Time Spent on Site': -0.1966950113270224, 'Age': 0.14699298814289136, 'Area Income': -0.00013020677570598898, 'Daily Internet Usage': -0.06268513367372519, 'Male': -0.5392970117985854}


In [16]:
# Coefficients fitted on the standardized data (requires standardize = True, which is on by default)
print(glm_model.coef_norm())

{'Intercept': 1.6886784338118903, 'Daily Time Spent on Site': -3.1568417772378043, 'Age': 1.2706869405802226, 'Area Income': -1.7506282774499542, 'Daily Internet Usage': -2.710143301946955, 'Male': -0.2694465303066473}


In [17]:
# Print the Coefficients table
glm_model._model_json['output']['coefficients_table']

Coefficients: glm coefficients



0,1,2,3,4,5
names,coefficients,std_error,z_value,p_value,standardized_coefficients
Intercept,27.8769209,3.0722314,9.0738349,0.0,1.6886784
Daily Time Spent on Site,-0.1966950,0.0227155,-8.6590663,0.0,-3.1568418
Age,0.1469930,0.0272385,5.3965219,0.0000001,1.2706869
Area Income,-0.0001302,0.0000201,-6.4704071,0.0000000,-1.7506283
Daily Internet Usage,-0.0626851,0.0076578,-8.1858094,0.0000000,-2.7101433
Male,-0.5392970,0.4556241,-1.1836445,0.2365538,-0.2694465




In [18]:
# Print the Standard error
print(glm_model._model_json['output']['coefficients_table']['std_error'])


[3.0722314292465556, 0.02271549899985897, 0.02723846770856031, 2.0123428634649435e-05, 0.007657780798422204, 0.45562413185764894]


In [21]:
predictions = glm_model.predict(test)
                               
predictions.head(5)

glm prediction progress: |████████████████████████████████████████████████| 100%


predict,p0,p1,StdErr
0,0.996067,0.00393345,0.612624
0,0.90338,0.0966205,0.478406
1,1.30426e-05,0.999987,1.26424
0,0.964354,0.0356458,0.506453
1,3.03995e-05,0.99997,1.1998




In [25]:
## Get the AUC on the validation set
perf = glm_model.model_performance(test)
print (perf)


ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.02369804444066122
RMSE: 0.1539416916909166
LogLoss: 0.09510143545409801
Null degrees of freedom: 197
Residual degrees of freedom: 192
Null deviance: 274.48628350173834
Residual deviance: 37.66016843982282
AIC: 49.66016843982282
AUC: 0.9907152331394756
pr_auc: 0.9832395618791827
Gini: 0.9814304662789513
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.20978118730958129: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,97.0,2.0,0.0202,(2.0/99.0)
1,3.0,96.0,0.0303,(3.0/99.0)
Total,100.0,98.0,0.0253,(5.0/198.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2097812,0.9746193,97.0
max f2,0.2097812,0.9716599,97.0
max f0point5,0.6888539,0.9894737,93.0
max accuracy,0.6888539,0.9747475,93.0
max precision,0.9999982,1.0,0.0
max recall,0.0058866,1.0,158.0
max specificity,0.9999982,1.0,0.0
max absolute_mcc,0.6888539,0.9507082,93.0
max min_per_class_accuracy,0.2097812,0.9696970,97.0


Gains/Lift Table: Avg response rate: 50.00 %, avg score: 78.48 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0101010,1.4512117,2.0,2.0,1.0,1.4767788,1.0,1.4767788,0.0202020,0.0202020,100.0,100.0
,2,0.0202020,1.3899024,2.0,2.0,1.0,1.4441281,1.0,1.4604534,0.0202020,0.0404040,100.0,100.0
,3,0.0303030,1.3512287,2.0,2.0,1.0,1.3759319,1.0,1.4322796,0.0202020,0.0606061,100.0,100.0
,4,0.0404040,1.3195998,2.0,2.0,1.0,1.3402391,1.0,1.4092695,0.0202020,0.0808081,100.0,100.0
,5,0.0505051,1.3039604,2.0,2.0,1.0,1.3157720,1.0,1.3905700,0.0202020,0.1010101,100.0,100.0
,6,0.1010101,1.2121425,2.0,2.0,1.0,1.2574698,1.0,1.3240199,0.1010101,0.2020202,100.0,100.0
,7,0.1515152,1.1723987,2.0,2.0,1.0,1.1924020,1.0,1.2801472,0.1010101,0.3030303,100.0,100.0
,8,0.2020202,1.1090618,2.0,2.0,1.0,1.1412939,1.0,1.2454339,0.1010101,0.4040404,100.0,100.0
,9,0.3030303,0.9654628,2.0,2.0,1.0,1.0382507,1.0,1.1763728,0.2020202,0.6060606,100.0,100.0






# Interpretation

In general, our model performed quite well F1 value of 0.97, precision value of 1.0, recall vale of 1.0 and AUC of 0.99