# AutoML using H2O
## Tabular Playground Series - Jan 2021

### Description:
In this notebbok we are going to use H2O's AutoML. It is one of the largest used AutoML libraries and is known for giving very good results. For the sake of demonstration I am going to try only for 3 model search but you can always experiement with it and train it for longer duration.

The following notebook has been inspired from various tutorials and kernels that have used H2O's AutoML to secure good ranks. Personally I found the results quite satisfactory after using this kernel conisdering the amount of work and time I had to spend to achieve that score.

## IMPORTING DEPENDENCIES

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tabular-playground-series-jan-2021/sample_submission.csv
/kaggle/input/tabular-playground-series-jan-2021/train.csv
/kaggle/input/tabular-playground-series-jan-2021/test.csv


In [2]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.9.1" 2020-11-04; OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04); OpenJDK 64-Bit Server VM (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpum91n_3y
  JVM stdout: /tmp/tmpum91n_3y/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpum91n_3y/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,04 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.2
H2O_cluster_version_age:,1 month and 23 days
H2O_cluster_name:,H2O_from_python_unknownUser_ol5drh
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.250 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


## IMPORTING DATASET

### H2O has its own way of handling datasets and we will need to import them as a file rather than reading them as a csv.

In [3]:
train = h2o.import_file('/kaggle/input/tabular-playground-series-jan-2021/train.csv')
test = h2o.import_file('/kaggle/input/tabular-playground-series-jan-2021/test.csv')

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
# Let us also read the csv in case we need them in later.

train_df = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/test.csv')

In [5]:
train.describe()

Rows:300000
Cols:16




Unnamed: 0,id,cont1,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,target
type,int,real,real,real,real,real,real,real,real,real,real,real,real,real,real,real
mins,1.0,-0.08226332148023098,-0.03139747284868896,0.020966867357024715,0.15276142274357513,0.2763766684547749,0.06616556355640804,-0.0976663005972248,0.2172599973404939,-0.24060419174975897,-0.08504600359700187,0.08327673657427467,0.08863482103728729,0.029950236962791588,0.16636741070307134,0.0
mean,249825.1458566666,0.5068728581831755,0.49789800448055244,0.5215572703508549,0.5156828403788997,0.5020220135884831,0.5265152304818906,0.4878900924194839,0.52516340241279,0.4598574065006892,0.5205322691134051,0.48392640188416003,0.506876563113431,0.5534416142060705,0.503712930924629,7.9056613283168335
maxs,499999.0,1.0162274167302328,0.8596967694315312,1.006954603242489,1.010402194425765,1.0342608913385214,1.0438577299007883,1.0661674751074297,1.0244272333729485,1.0041140988637949,1.1999513922566574,1.0226201415878613,1.0490254841877338,0.9778450539552797,0.8685064129198011,10.267568500800396
sigma,144476.73256229569,0.20397619377641646,0.2281594531471517,0.20077005864001526,0.23303548066745436,0.22070118123545235,0.21790897941119108,0.18109605419380245,0.21622147432174998,0.19668460399631907,0.20185419152962933,0.22008244024189094,0.21894739994721182,0.2297303024870577,0.20823755996298202,0.733070830366318
zeros,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1.0,0.6703898514390889,0.8112995057309422,0.6439683093331412,0.2917913764510022,0.28411737646993296,0.8559531758452059,0.8907004183744506,0.2855421109796029,0.5582454418515085,0.7794183626907151,0.9218320519913782,0.8667720988813201,0.8787327721946618,0.3054113450701753,7.243042589449295
1,3.0,0.3880525276975261,0.6211042271574185,0.6861020924830562,0.5011490796546958,0.6437895146086654,0.449804950718596,0.5108237501974809,0.5807482361435341,0.4183350707888616,0.4326316993235111,0.4398722862927277,0.4349705690738134,0.3699574333716138,0.3694841682508747,8.203331138256422
2,4.0,0.8349504778390991,0.2274363757909521,0.3015838588756856,0.293408406815278,0.6068394934817684,0.8291750847270303,0.5061434957864068,0.5587710129561313,0.5876031455771054,0.8233116378351174,0.5670066799940074,0.6777078828596778,0.8829380804527522,0.3030471034878757,7.776090759821726


In [6]:
# Prepare the data

y = 'target'
x = train.columns
x.remove(y)
x.remove('id')

In [7]:
# max_models can be played around with and seed as well. Greater the number of max_models greater is the time that its gonna take. The best part about it is that
# It even tries out various ensemble models.

aml = H2OAutoML(max_models = 3, seed = 1)
aml.train(x = x, y = y, training_frame = train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [8]:
# h20 saves the models in a table format where it has the model name and the various parameters such as rmse, mse, mae and more
lb = aml.leaderboard

In [9]:
# Let's have a look at some of the rows in the table.
lb.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_3_AutoML_20210110_122336,0.494018,0.702864,0.494018,0.588499,0.0799142
StackedEnsemble_AllModels_AutoML_20210110_122336,0.511833,0.715425,0.511833,0.605068,0.0813321
XGBoost_1_AutoML_20210110_122336,0.513151,0.716346,0.513151,0.594672,0.0813992
XGBoost_2_AutoML_20210110_122336,0.579034,0.760943,0.579034,0.622973,0.0863801




In [10]:
# To view all the models and their scores we can use the rows function to display all of them.
lb.head(rows=lb.nrows)

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_3_AutoML_20210110_122336,0.494018,0.702864,0.494018,0.588499,0.0799142
StackedEnsemble_AllModels_AutoML_20210110_122336,0.511833,0.715425,0.511833,0.605068,0.0813321
XGBoost_1_AutoML_20210110_122336,0.513151,0.716346,0.513151,0.594672,0.0813992
XGBoost_2_AutoML_20210110_122336,0.579034,0.760943,0.579034,0.622973,0.0863801




In [11]:
# choose the best model which is the first record in the table as our model.
model = aml.leader

In [12]:
# use the leader model to predict on the test dataset. Note we are using the test file imported in h2o and not the dataframe/
preds = model.predict(test)

xgboost prediction progress: |████████████████████████████████████████████| 100%


In [13]:
# convert the predicts into a list using the as_list function adn then create our final submission file.
final = h2o.as_list(preds)
final['predict']

0         7.987298
1         7.877398
2         7.936702
3         8.192446
4         8.215229
            ...   
199995    8.295063
199996    8.165467
199997    8.085840
199998    7.994658
199999    7.970603
Name: predict, Length: 200000, dtype: float64

In [14]:
sub = pd.DataFrame()
sub['id'] = test_df['id']
sub['target'] = final['predict']
sub.to_csv('submission.csv',index=False)

In [15]:
# If you are reading this thanks for dropping by. Please upvote if you find it useful.