<a href="https://www.kaggle.com/code/taimour/cibmtr-automl-h2o-step-by-step-explained?scriptVersionId=211419671" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <span style="background-color:#bbebfa;color:black;padding:10px;border-radius:40px;">🎒Import Libraries</span>

In [None]:
import pandas as pd

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">This imports the Pandas library and assigns it the alias pd. Pandas is a powerful Python library used for data manipulation and analysis, especially for handling tabular data structures like DataFrames.</div>

In [None]:
import h2o

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">This imports the H2O library. H2O is an open-source machine learning platform that provides scalable and fast algorithms for building predictive models. It supports various supervised and unsupervised learning tasks.</div>

In [None]:
from h2o.automl import H2OAutoML

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">This imports the H2OAutoML class from the H2O AutoML module. H2OAutoML is an automated machine learning system that automatically trains and tunes multiple models, ranks them, and selects the best one based on performance. It significantly reduces the need for manual intervention in the model training process.</div>

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">⬆️ Load Data</span>

In [None]:
#reading data
train = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/train.csv')
test  = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/test.csv')
sub = pd.read_csv('/kaggle/input/equity-post-HCT-survival-predictions/sample_submission.csv')

#droping columns
train.drop(columns=['ID'], inplace=True)
test.drop(columns=['ID'], inplace=True)
train.drop(columns=['efs_time'], inplace=True)

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">🔎 View Data</span>

In [None]:
train.head()

In [None]:
test.head()

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">🌊  Intialize H2o</span>

In [None]:
h2o.init()

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">
    
**What It Does:** The line `h2o.init()` does the following

**Starts an H2O Cluster:**
It launches an H2O cluster either locally on your machine (by default) or connects to 
an existing H2O cluster if one is available.

**Specifies Resources:**
By default, it allocates a certain amount of your machine's memory and CPU cores to 
the H2O cluster. You can control these resources by passing arguments like max_mem_size 
or nthreads.

**Checks for Java:**
H2O is a Java-based machine learning library, so it verifies if Java is installed and 
properly configured in your environment.

**Prints Connection Details:**
Once the H2O instance is up, it prints out connection details like the IP address, 
port number, and version of H2O running. This helps verify that the H2O cluster is 
ready for use.</div>

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">🚝 Training</span>

In [None]:
train_data = h2o.H2OFrame(train)
train_data['efs'] = train_data['efs'].asfactor()

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">
    
**What It Does:**
The line `train_data = h2o.H2OFrame(train)` converts a standard Pandas DataFrame (or another dataset format) into an H2OFrame, which is the primary data structure used by H2O.

**Converts Data:** 
train is assumed to be a Pandas DataFrame (or potentially another data format). This line converts it to an H2OFrame, which is the native format required by H2O for its machine learning algorithms.
An H2OFrame is similar to a Pandas DataFrame but designed for distributed, scalable computing, making it efficient for large datasets.

**Allows H2O Processing:** 
H2O algorithms (like AutoML or other machine learning models) require the data to be in this format to perform operations like training, validation, and prediction.

**Scalability:** 
Unlike Pandas, H2OFrames are designed to handle large, distributed datasets efficiently.

**Integration with H2O Algorithms:** 
Machine learning algorithms in H2O require this data structure.</div>

In [None]:
aml = H2OAutoML(max_runtime_secs=3600,seed=5)

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">
    
**What It Does:**
The line `aml = H2OAutoML(max_runtime_secs=no_of_seconds, seed=5)` initializes an H2OAutoML object with specific parameters to control the behavior of the automated machine learning process.

**Creates an H2OAutoML Object:**
H2OAutoML automates the machine learning workflow, including training and tuning a variety of models, performing cross-validation, and selecting the best model based on performance.

**Parameters:**
**max_runtime_secs=no_of_seconds:** This specifies the maximum amount of time (in seconds) that AutoML will be allowed to run. AutoML will try various models and hyperparameters during this time limit.

**seed=5:** This sets the random seed for reproducibility. With the same seed, running AutoML multiple times will result in the same results, assuming all other factors (like data) remain the same.</div>

In [None]:
aml.train(y='efs', training_frame=train_data)

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">

**What It Does:**
The line `aml.train(y='efs', training_frame=train_data)` starts the H2O AutoML training process on the dataset stored in train_data. Here’s a breakdown of what happens:

**Trains the AutoML Model:**
This command tells H2OAutoML to start the automated machine learning process using the dataset in train_data.

**Specifies the Target (Dependent) Variable:**
y='efs': This specifies the column name of the target variable you want to predict. In this case, price is the column in train_data that holds the value you want the model to predict (e.g., the price of used cars if you're doing price prediction).

**Provides the Training Data:**
training_frame=train_data: This is the dataset on which the AutoML model will be trained. train_data should be an H2OFrame (converted using h2o.H2OFrame(train)).</div>

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">🎯 Leaderboard</span>

In [None]:
leaderboard = aml.leaderboard
leaderboard

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">

**What It Does:**
The line `leaderboard = aml.leaderboard` retrieves the leaderboard from the H2O AutoML object (aml). The leaderboard is a ranked list of the models that were trained during the AutoML process, sorted by performance.

**Retrieves the Leaderboard:**
The leaderboard contains all models trained by AutoML, ranked from the best to the worst performing model based on a default metric (e.g., RMSE for regression tasks or AUC for classification tasks).

**Assigns to a Variable:**
By assigning aml.leaderboard to leaderboard, you're saving the ranked list of models to the leaderboard variable, which you can then use for further inspection or analysis.</div>

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">🚀 Best Model</span>

In [None]:
best_model = aml.leader
best_model

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">

**What It Does:**
The line `best_model = aml.leader` assigns the best model from the H2O AutoML process to the variable best_model. The leader (or best model) is the model with the highest performance, according to the default evaluation metric (e.g., RMSE for regression, AUC for classification).

**Selects the Best Model:**
The AutoML leader is the top-ranked model on the leaderboard, meaning it's the one that performed the best based on the evaluation metric used during training.

**Assigns the Model:**
By assigning aml.leader to best_model, you're saving the best model object in the best_model variable for further use, like making predictions or evaluating its performance.</div>

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">🛡️ Make Predictions</span>

In [None]:
test_data = h2o.H2OFrame(test)

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">
    
**What It Does:**
The line `test_data = h2o.H2OFrame(test)` converts the test dataset (assumed to be a Pandas DataFrame or another format) into an H2OFrame, which is the required data format for making predictions using models trained with H2O.

**Converts Data:**
test is the test dataset in a format like a Pandas DataFrame. This line converts it to an H2OFrame, which is the native format used by H2O for handling datasets in the context of machine learning models.

**Makes the Data Usable by H2O:**
Once converted, test_data (an H2OFrame) can be passed into H2O models (such as best_model from H2OAutoML) for prediction and evaluation.
</div>

In [None]:
test_data['donor_age'] = test_data['donor_age'].asnumeric()

In [None]:
predictions = best_model.predict(test_data)

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">

**What It Does:**
The line `predictions = best_model.predict(test_data)` uses the best model from H2O AutoML to make predictions on the test_data. Here’s a detailed explanation of what it does:

**Uses the Best Model:**
The best_model (retrieved via best_model = aml.leader) is the model that performed the best during the AutoML process. This model is now used to predict the target variable for new or unseen data (here, test_data).

**Makes Predictions:**
`best_model.predict(test_data)` performs predictions on the test_data, which must be an H2OFrame. The model uses the features in test_data to predict the target variable (in this case, price or whatever target column you specified in the training phase).

**Stores Predictions:**
The predictions are stored in the predictions variable, which is an H2OFrame. Each row in predictions corresponds to a prediction for the respective row in test_data.</div>

In [None]:
predictions_df = predictions.as_data_frame()

<div style="background-color:white;color:black;padding:10px;border:5px solid #53c95b;border-radius:20px;">

**What It Does:**
The line `predictions_df = predictions.as_data_frame()` converts the H2OFrame containing the predictions into a Pandas DataFrame. This is useful for easier manipulation and analysis of the prediction results in a format commonly used in data science and analytics.

**Converts H2OFrame to Pandas DataFrame:**
The predictions variable holds the output from the model's prediction, which is in the H2OFrame format. Calling .as_data_frame() on it converts this H2OFrame into a Pandas DataFrame.</div>

# <span style="background-color:#d4fad6;color:black;padding:10px;border-radius:40px;">📁 Submit Results</span>

In [None]:
#save predicted values in price column
sub['prediction'] = (predictions_df['predict'].values)

#save results in csv for submission
sub.to_csv('submission.csv', index=False)

In [None]:
sub.head()