<a href="https://www.kaggle.com/code/taimour/s4e9-h2o-explained-fast-quick?scriptVersionId=198467869" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Cars need some H2o 🌊 Let's begin!

![](https://i.ibb.co/LCqzXRj/pexels-koprivakart-3354648.jpg)

# Import Libraries

In [1]:
import pandas as pd

This imports the Pandas library and assigns it the alias pd. Pandas is a powerful Python library used for data manipulation and analysis, especially for handling tabular data structures like DataFrames.

In [2]:
import h2o

This imports the H2O library. H2O is an open-source machine learning platform that provides scalable and fast algorithms for building predictive models. It supports various supervised and unsupervised learning tasks.

In [3]:
from h2o.automl import H2OAutoML

This imports the H2OAutoML class from the H2O AutoML module. H2OAutoML is an automated machine learning system that automatically trains and tunes multiple models, ranks them, and selects the best one based on performance. It significantly reduces the need for manual intervention in the model training process.

# Read Data

In [4]:
#reading data
train = pd.read_csv('/kaggle/input/playground-series-s4e9/train.csv')
test  = pd.read_csv('/kaggle/input/playground-series-s4e9/test.csv')
sub = pd.read_csv('/kaggle/input/playground-series-s4e9/sample_submission.csv')

#droping id column
train.drop(columns=['id'], inplace=True)
test.drop(columns=['id'], inplace=True)

# Intialize H2o

In [5]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.24" 2024-07-16; OpenJDK Runtime Environment (build 11.0.24+8-post-Ubuntu-1ubuntu320.04); OpenJDK 64-Bit Server VM (build 11.0.24+8-post-Ubuntu-1ubuntu320.04, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.10/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmplo1711d8
  JVM stdout: /tmp/tmplo1711d8/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmplo1711d8/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.4
H2O_cluster_version_age:,2 months and 17 days
H2O_cluster_name:,H2O_from_python_unknownUser_dr0h1o
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.500 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


**What It Does:** The line `h2o.init()` does the following

**Starts an H2O Cluster:**
It launches an H2O cluster either locally on your machine (by default) or connects to 
an existing H2O cluster if one is available.

**Specifies Resources:**
By default, it allocates a certain amount of your machine's memory and CPU cores to 
the H2O cluster. You can control these resources by passing arguments like max_mem_size 
or nthreads.

**Checks for Java:**
H2O is a Java-based machine learning library, so it verifies if Java is installed and 
properly configured in your environment.

**Prints Connection Details:**
Once the H2O instance is up, it prints out connection details like the IP address, 
port number, and version of H2O running. This helps verify that the H2O cluster is 
ready for use.

# Training

In [6]:
train_data = h2o.H2OFrame(train)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


**What It Does:** 
The line `train_data = h2o.H2OFrame(train)` converts a standard Pandas DataFrame (or another dataset format) into an H2OFrame, which is the primary data structure used by H2O.

**Converts Data:** 
train is assumed to be a Pandas DataFrame (or potentially another data format). This line converts it to an H2OFrame, which is the native format required by H2O for its machine learning algorithms.
An H2OFrame is similar to a Pandas DataFrame but designed for distributed, scalable computing, making it efficient for large datasets.

**Allows H2O Processing:** 
H2O algorithms (like AutoML or other machine learning models) require the data to be in this format to perform operations like training, validation, and prediction.

**Scalability:** 
Unlike Pandas, H2OFrames are designed to handle large, distributed datasets efficiently.

**Integration with H2O Algorithms:** 
Machine learning algorithms in H2O require this data structure.

In [7]:
aml = H2OAutoML(max_runtime_secs=7200,seed=5)

**What It Does:**
The line `aml = H2OAutoML(max_runtime_secs=7200, seed=5)` initializes an H2OAutoML object with specific parameters to control the behavior of the automated machine learning process.

**Creates an H2OAutoML Object:**
H2OAutoML automates the machine learning workflow, including training and tuning a variety of models, performing cross-validation, and selecting the best model based on performance.

**Parameters:**
**max_runtime_secs=7200:** This specifies the maximum amount of time (in seconds) that AutoML will be allowed to run. Here, 7200 seconds equals 2 hours. AutoML will try various models and hyperparameters during this time limit.

**seed=5:** This sets the random seed for reproducibility. With the same seed, running AutoML multiple times will result in the same results, assuming all other factors (like data) remain the same.

In [8]:
aml.train(y='price', training_frame=train_data)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


key,value
Stacking strategy,cross_validation
Number of base models (used / total),18/100
# GBM base models (used / total),9/48
# XGBoost base models (used / total),6/45
# DeepLearning base models (used / total),2/4
# DRF base models (used / total),1/2
# GLM base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,5

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
aic,950605.56,6745.6006,955459.7,949115.25,949596.8,940697.7,958158.4
loglikelihood,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mae,19179.672,163.26158,19191.277,19188.055,19217.787,18923.385,19377.854
mean_residual_deviance,5268907000.0,1003293570.0,5336161300.0,4800975900.0,4418981900.0,4822168600.0,6966246400.0
mse,5268907000.0,1003293570.0,5336161300.0,4800975900.0,4418981900.0,4822168600.0,6966246400.0
null_deviance,234252986000000.0,37092495000000.0,236841593000000.0,218761760000000.0,203139320000000.0,215518322000000.0,297003968000000.0
r2,0.1549615,0.0229434,0.1469948,0.1711444,0.1753038,0.1625759,0.1187888
residual_deviance,198613012000000.0,37379620000000.0,202027058000000.0,181318453000000.0,167528035000000.0,180469660000000.0,261721885000000.0
rmse,72343.88,6639.7993,73049.03,69289.08,66475.42,69441.836,83464.05
rmsle,0.533722,0.001438,0.5351196,0.532972,0.5346053,0.5343354,0.5315775


**What It Does:**
The line `aml.train(y='price', training_frame=train_data)` starts the H2O AutoML training process on the dataset stored in train_data. Here’s a breakdown of what happens:

**Trains the AutoML Model:**
This command tells H2OAutoML to start the automated machine learning process using the dataset in train_data.

**Specifies the Target (Dependent) Variable:**
y='price': This specifies the column name of the target variable you want to predict. In this case, price is the column in train_data that holds the value you want the model to predict (e.g., the price of used cars if you're doing price prediction).

**Provides the Training Data:**
training_frame=train_data: This is the dataset on which the AutoML model will be trained. train_data should be an H2OFrame (converted using h2o.H2OFrame(train)).

# Leaderboard

In [9]:
leaderboard = aml.leaderboard
print(leaderboard)

model_id                                                   rmse          mse      mae       rmsle    mean_residual_deviance
StackedEnsemble_AllModels_3_AutoML_1_20240927_51940     72576.1  5.26729e+09  19178.3    0.533774               5.26729e+09
StackedEnsemble_AllModels_4_AutoML_1_20240927_51940     72585.2  5.26861e+09  19181.1    0.533744               5.26861e+09
StackedEnsemble_BestOfFamily_4_AutoML_1_20240927_51940  72601.3  5.27094e+09  19232      0.537721               5.27094e+09
StackedEnsemble_BestOfFamily_5_AutoML_1_20240927_51940  72615.3  5.27298e+09  19136.3  nan                      5.27298e+09
StackedEnsemble_BestOfFamily_3_AutoML_1_20240927_51940  72655.6  5.27883e+09  19309.3    0.535149               5.27883e+09
StackedEnsemble_AllModels_2_AutoML_1_20240927_51940     72657.2  5.27907e+09  19309.6    0.53571                5.27907e+09
DeepLearning_grid_1_AutoML_1_20240927_51940_model_1     72789.5  5.29831e+09  19175.6  nan                      5.29831e+09
DeepLear

**What It Does:**
The line `leaderboard = aml.leaderboard` retrieves the leaderboard from the H2O AutoML object (aml). The leaderboard is a ranked list of the models that were trained during the AutoML process, sorted by performance.

**Retrieves the Leaderboard:**
The leaderboard contains all models trained by AutoML, ranked from the best to the worst performing model based on a default metric (e.g., RMSE for regression tasks or AUC for classification tasks).

**Assigns to a Variable:**
By assigning aml.leaderboard to leaderboard, you're saving the ranked list of models to the leaderboard variable, which you can then use for further inspection or analysis.

# Best Model

In [10]:
best_model = aml.leader
print(best_model)

Model Details
H2OStackedEnsembleEstimator : Stacked Ensemble
Model Key: StackedEnsemble_AllModels_3_AutoML_1_20240927_51940


Model Summary for Stacked Ensemble: 
key                                        value
-----------------------------------------  ----------------
Stacking strategy                          cross_validation
Number of base models (used / total)       18/100
# GBM base models (used / total)           9/48
# XGBoost base models (used / total)       6/45
# DeepLearning base models (used / total)  2/4
# DRF base models (used / total)           1/2
# GLM base models (used / total)           0/1
Metalearner algorithm                      GLM
Metalearner fold assignment scheme         Random
Metalearner nfolds                         5
Metalearner fold_column
Custom metalearner hyperparameters         None

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 4843709813.669395
RMSE: 69596.76582765463
MAE: 18206.228591565166
RMSLE: 0.506287280494

**What It Does:**
The line `best_model = aml.leader` assigns the best model from the H2O AutoML process to the variable best_model. The leader (or best model) is the model with the highest performance, according to the default evaluation metric (e.g., RMSE for regression, AUC for classification).

**Selects the Best Model:**
The AutoML leader is the top-ranked model on the leaderboard, meaning it's the one that performed the best based on the evaluation metric used during training.

**Assigns the Model:**
By assigning aml.leader to best_model, you're saving the best model object in the best_model variable for further use, like making predictions or evaluating its performance.

# Make Predictions

In [11]:
test_data = h2o.H2OFrame(test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


**What It Does:**
The line `test_data = h2o.H2OFrame(test)` converts the test dataset (assumed to be a Pandas DataFrame or another format) into an H2OFrame, which is the required data format for making predictions using models trained with H2O.

**Converts Data:**
test is the test dataset in a format like a Pandas DataFrame. This line converts it to an H2OFrame, which is the native format used by H2O for handling datasets in the context of machine learning models.

**Makes the Data Usable by H2O:**
Once converted, test_data (an H2OFrame) can be passed into H2O models (such as best_model from H2OAutoML) for prediction and evaluation.

In [12]:
predictions = best_model.predict(test_data)

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%




**What It Does:**
The line `predictions = best_model.predict(test_data)` uses the best model from H2O AutoML to make predictions on the test_data. Here’s a detailed explanation of what it does:

**Uses the Best Model:**
The best_model (retrieved via best_model = aml.leader) is the model that performed the best during the AutoML process. This model is now used to predict the target variable for new or unseen data (here, test_data).

**Makes Predictions:**
`best_model.predict(test_data)` performs predictions on the test_data, which must be an H2OFrame. The model uses the features in test_data to predict the target variable (in this case, price or whatever target column you specified in the training phase).

**Stores Predictions:**
The predictions are stored in the predictions variable, which is an H2OFrame. Each row in predictions corresponds to a prediction for the respective row in test_data.

In [13]:
predictions_df = predictions.as_data_frame()




**What It Does:**
The line `predictions_df = predictions.as_data_frame()` converts the H2OFrame containing the predictions into a Pandas DataFrame. This is useful for easier manipulation and analysis of the prediction results in a format commonly used in data science and analytics.

**Converts H2OFrame to Pandas DataFrame:**
The predictions variable holds the output from the model's prediction, which is in the H2OFrame format. Calling .as_data_frame() on it converts this H2OFrame into a Pandas DataFrame.

# Submit Results

In [14]:
#save predicted values in price column
sub['price'] = (predictions_df['predict'].values)

#save results in csv for submission
sub.to_csv('submission.csv', index=False)