

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>AutoML</b></div>

![image.png](attachment:994aab67-d3fd-4ae5-805a-b6101334df08.png)

**Image credits:** https://miro.medium.com/v2/resize:fit:1100/format:webp/1*aqonxCij9ZyOnxNWpkkKvA.jpeg


AutoML, short for Automated Machine Learning, refers to the process of automating the end-to-end process of applying machine learning to real-world problems. The goal of AutoML is to automate as many steps as possible in the machine learning pipeline, reducing the amount of manual effort required from data preprocessing to model deployment. This automation is achieved through the use of algorithms and techniques that allow for automatic feature engineering, model selection, hyperparameter optimization, and model deployment.

Key aspects of AutoML include:

1. **Data Preparation**: Handling data ingestion, cleaning, transformation, and feature engineering to prepare the data for modeling.

2. **Model Selection**: Automatically selecting the best-performing machine learning models based on the dataset and problem type.

3. **Hyperparameter Optimization**: Automatically tuning the hyperparameters of selected models to optimize their performance.

4. **Model Evaluation**: Evaluating and comparing models using metrics relevant to the problem at hand.

5. **Ensemble Methods**: Combining predictions from multiple models (ensembles) to improve overall predictive accuracy.

6. **Deployment**: Facilitating the deployment of trained models into production environments.

### So in this complete notebook we will look at some of the major AutoML libraries
1. Lazypredict
2. TPOT
3. H2O AutoML
4. H2O with sklearn

### In future I will include
1. AutoSklearn
2. AutoViML
3. PyCaret
4. EvalML
5. Some AutoEDA libraries too

#### Though I tried including all these in the first version itself but eventually I wasnt able to either install or load these so anyways I will find some solution and post it up



<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">Will AutoML replace data scientists? NO</div>

### Here are several reasons why AutoML is not expected to replace data scientists:

1. **Domain Expertise**: Data scientists bring domain knowledge and understanding of specific business problems, which is crucial for framing machine learning tasks, interpreting results, and translating insights into actionable strategies. AutoML tools can automate technical aspects but still require human expertise to contextualize and apply results effectively.

2. **Complex Problem Solving**: Many real-world problems require creative problem-solving and innovative approaches that go beyond standard machine learning techniques. Data scientists are essential for designing custom solutions, adapting algorithms to unique challenges, and experimenting with new methodologies.

3. **Data Understanding and Preparation**: While AutoML can handle some aspects of data preprocessing, such as missing value imputation and feature engineering, understanding the nuances of data, identifying biases, and curating datasets for specific tasks often require human judgment and domain expertise.

4. **Interpretability and Explainability**: Data scientists are responsible for interpreting model outputs, understanding their limitations, and ensuring models are explainable and fair. AutoML may produce complex models that are harder to interpret without human intervention.

5. **Iterative Improvement**: Developing effective machine learning solutions often involves iterative improvement cycles, where data scientists analyze model performance, make adjustments, and validate results. AutoML can accelerate these cycles but typically requires human oversight to ensure continuous improvement and adaptability.

6. **Problem Framing and Solution Design**: Data scientists play a crucial role in framing machine learning problems in ways that align with business objectives, identifying appropriate metrics for evaluation, and integrating machine learning solutions into broader organizational workflows.

In [1]:
# We will use GPU to make the process fast as it is adviced to use GPU with h2o automl as it integrates gpu support in xgboost
!nvidia-smi

Thu Jul 18 19:34:29 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8             11W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00



<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>Installing all important libraries</b></div>

In [2]:
!pip install tpot



In [3]:
!pip install h2o



In [4]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl.metadata (12 kB)
Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12




<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>LazyPredict</b></div>

LazyPredict is a AutoML library designed to simplify the process of building and evaluating machine learning models. It automates the creation and comparison of multiple machine learning models with minimal code, making it an excellent tool for quickly assessing various algorithms on a given dataset. Here are some key features and benefits of LazyPredict:

### Key Features

1. **Model Selection and Comparison:**
   - LazyPredict creates and evaluates multiple machine learning models without requiring detailed parameter tuning or configuration. This allows users to quickly compare the performance of different algorithms.

2. **Ease of Use:**
   - With a simple and intuitive API, LazyPredict makes it easy to use for both beginners and experienced data scientists. It requires minimal setup and can be integrated into existing workflows seamlessly.

3. **Comprehensive Model Evaluation:**
   - The library evaluates a wide range of models from scikit-learn, including linear models, ensemble methods, and tree-based algorithms. It provides metrics such as accuracy, F1 score, and more to compare model performance.

5. **Time-Saving:**
   - By automating the model training and evaluation process, LazyPredict saves significant time, especially in the initial stages of model development and selection.


### Limitations

- **Not for Production:**
  While LazyPredict is great for initial exploration, it is not intended for production use. Once a model is selected, further tuning and validation are necessary.

- **Limited Customization:**
  The library abstracts much of the model training process, which limits the ability to customize and fine-tune models compared to manual coding.


In [5]:
import lazypredict
from lazypredict.Supervised import LazyClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X = data.data
y= data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.30,random_state =123)
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

print(models)

100%|██████████| 29/29 [00:01<00:00, 14.81it/s]

[LightGBM] [Info] Number of positive: 254, number of negative: 144
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000263 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3976
[LightGBM] [Info] Number of data points in the train set: 398, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.638191 -> initscore=0.567521
[LightGBM] [Info] Start training from score 0.567521
                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
LogisticRegression                 0.99               0.99     0.99      0.99   
SGDClassifier                      0.99               0.99     0.99      0.99   
Perceptron                         0.99               0.99     0.99      0.99   
LinearSVC                          0.99               0.99     0.99      0.99   
ExtraTreesClassifier        




In [6]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LogisticRegression,0.99,0.99,0.99,0.99,0.06
SGDClassifier,0.99,0.99,0.99,0.99,0.03
Perceptron,0.99,0.99,0.99,0.99,0.01
LinearSVC,0.99,0.99,0.99,0.99,0.05
ExtraTreesClassifier,0.98,0.98,0.98,0.98,0.2
SVC,0.98,0.98,0.98,0.98,0.03
RandomForestClassifier,0.98,0.98,0.98,0.98,0.29
RidgeClassifier,0.98,0.98,0.98,0.98,0.03
QuadraticDiscriminantAnalysis,0.98,0.98,0.98,0.98,0.02
AdaBoostClassifier,0.98,0.98,0.98,0.98,0.2


In [7]:
from lazypredict.Supervised import LazyRegressor
from sklearn import datasets
from sklearn.utils import shuffle
import numpy as np



from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X, y = shuffle(housing.data, housing.target, random_state=13)

offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)

print(models)

 76%|███████▌  | 32/42 [02:36<00:28,  2.81s/it]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


 98%|█████████▊| 41/42 [03:03<00:02,  2.55s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001506 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 18576, number of used features: 8
[LightGBM] [Info] Start training from score 2.063611


100%|██████████| 42/42 [03:03<00:00,  4.38s/it]

                               Adjusted R-Squared  R-Squared  RMSE  Time Taken
Model                                                                         
HistGradientBoostingRegressor                0.84       0.84  0.47        0.60
XGBRegressor                                 0.84       0.84  0.48        0.35
LGBMRegressor                                0.83       0.83  0.48        0.24
ExtraTreesRegressor                          0.82       0.82  0.50        4.31
RandomForestRegressor                        0.82       0.82  0.50       12.81
BaggingRegressor                             0.80       0.80  0.53        1.33
GradientBoostingRegressor                    0.78       0.78  0.55        4.14
NuSVR                                        0.75       0.75  0.59       17.69
SVR                                          0.74       0.74  0.60       13.24
MLPRegressor                                 0.71       0.71  0.63       17.94
KNeighborsRegressor                          0.70   




In [8]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HistGradientBoostingRegressor,0.84,0.84,0.47,0.6
XGBRegressor,0.84,0.84,0.48,0.35
LGBMRegressor,0.83,0.83,0.48,0.24
ExtraTreesRegressor,0.82,0.82,0.5,4.31
RandomForestRegressor,0.82,0.82,0.5,12.81
BaggingRegressor,0.8,0.8,0.53,1.33
GradientBoostingRegressor,0.78,0.78,0.55,4.14
NuSVR,0.75,0.75,0.59,17.69
SVR,0.74,0.74,0.6,13.24
MLPRegressor,0.71,0.71,0.63,17.94




<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>TPOT</b></div>

TPOT (Tree-based Pipeline Optimization Tool) is an automated machine learning (AutoML) library in Python that uses genetic programming to optimize machine learning pipelines. It aims to automate the process of selecting the best model and hyperparameters for a given dataset, making it easier to create high-performing machine learning models without extensive manual tuning.

Here are some key features and concepts of TPOT:

1. **Genetic Programming:**
   TPOT uses genetic programming to explore and optimize machine learning pipelines. Genetic programming is an evolutionary algorithm-based methodology inspired by biological evolution to find approximate solutions to optimization and search problems.

2. **Pipeline Optimization:**
   TPOT automatically explores thousands of possible pipelines, which consist of data preprocessing steps, feature selection methods, and machine learning models, to find the best one for the dataset.

3. **Model and Hyperparameter Selection:**
   TPOT not only selects the best model but also tunes its hyperparameters, providing a complete solution for creating high-performing machine learning models.

4. **Scikit-learn Integration:**
   TPOT is built on top of Scikit-learn, a popular machine learning library in Python. It leverages Scikit-learn’s models and tools to create and optimize pipelines.

5. **Parallel Processing:**
   TPOT supports parallel processing, allowing it to speed up the optimization process by utilizing multiple CPU cores.

6. **Human-readable Pipelines:**
   One of the significant advantages of TPOT is that it generates human-readable code for the optimized pipeline, making it easy to understand, modify, and reproduce the results.

### Parameters

- `generations`: The number of iterations TPOT will run to optimize the pipeline.
- `population_size`: The number of pipelines in the population for each generation.
- `verbosity`: Controls the level of detail of the output during the optimization process.
- `random_state`: Ensures reproducibility of results by setting a seed for the random number generator.


### Limitations

1. **Computationally Intensive:**
   TPOT’s optimization process can be computationally expensive, especially for large datasets and complex pipelines.

2. **Black-box Nature:**
   The optimization process is largely a black box, which might be less suitable for users who need fine-grained control over model selection and tuning.

3. **Limited Customization:**
   While TPOT automates many aspects of model selection and tuning, it might not provide the same level of customization and flexibility as manual model development.


![image.png](attachment:19aaba62-c1b2-4271-930e-980a176c31e7.png)

In [9]:
from tpot import  TPOTClassifier
from sklearn.model_selection import train_test_split
from __future__ import print_function
import sys,tempfile, urllib, os
import pandas as pd
import numpy as np

In [10]:
churn_df=pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [11]:
print ("Rows     : " ,churn_df.shape[0])
print ("Columns  : " ,churn_df.shape[1])

Rows     :  7043
Columns  :  21


In [12]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder

categorical_columns = ['gender', 'Partner', 'Dependents','PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','PaymentMethod','Churn']
column_trans = make_column_transformer((OrdinalEncoder(), categorical_columns))

churn_transformed=column_trans.fit_transform(churn_df)

In [13]:
churn_df_trans = churn_df.copy()
churn_df_trans = pd.DataFrame(churn_transformed, columns=categorical_columns)
churn_df.update(churn_df_trans)
     

churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0.0,0,1.0,0.0,1,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2.0,29.85,29.85,0.0
1,5575-GNVDE,1.0,0,0.0,0.0,34,1.0,0.0,0.0,2.0,...,2.0,0.0,0.0,0.0,1.0,0.0,3.0,56.95,1889.5,0.0
2,3668-QPYBK,1.0,0,0.0,0.0,2,1.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,3.0,53.85,108.15,1.0
3,7795-CFOCW,1.0,0,0.0,0.0,45,0.0,1.0,0.0,2.0,...,2.0,2.0,0.0,0.0,1.0,0.0,0.0,42.3,1840.75,0.0
4,9237-HQITU,0.0,0,0.0,0.0,2,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2.0,70.7,151.65,1.0


In [14]:
churn_df.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [15]:
churn_df.replace(r'^\s*$', np.nan, regex=True).isna().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [16]:
churn_df = churn_df.replace(r'^\s*$', np.nan, regex=True)
churn_df.isna().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [17]:
churn_df.dtypes

customerID           object
gender              float64
SeniorCitizen         int64
Partner             float64
Dependents          float64
tenure                int64
PhoneService        float64
MultipleLines       float64
InternetService     float64
OnlineSecurity      float64
OnlineBackup        float64
DeviceProtection    float64
TechSupport         float64
StreamingTV         float64
StreamingMovies     float64
Contract            float64
PaperlessBilling    float64
PaymentMethod       float64
MonthlyCharges      float64
TotalCharges         object
Churn               float64
dtype: object

In [18]:
churn_df.iloc[:, 19] = pd.to_numeric(churn_df.iloc[:, 19], errors='coerce') 
from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')

In [19]:
churn_df.iloc[:, 19] = imp_median.fit_transform(churn_df.iloc[:, 19].values.reshape(-1, 1) )
churn_df.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [20]:
churn_df_X = churn_df.drop("Churn", axis=1)
churn_df_X = churn_df_X.drop("customerID", axis=1)
churn_df_y = churn_df['Churn']
X_train, X_test, y_train, y_test = train_test_split(churn_df_X, churn_df_y, train_size=0.75, test_size=0.25)

In [21]:
tpot =  TPOTClassifier(generations=4, population_size=10, verbosity=3)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

32 operators have been imported by TPOT.


Version 0.12.1 of tpot is outdated. Version 0.12.2 was released Friday February 23, 2024.


Optimization Progress:   0%|          | 0/50 [00:00<?, ?pipeline/s]

_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True, Parameters: penalty='l1', loss='squared_hinge', dual=True.

Generation 1 - Current Pareto front scores:

-1	0.8057527737163499	RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=gini, RandomForestClassifier__max_features=0.45, RandomForestClassifier__min_samples_leaf=14, RandomForestClassifier__min_samples_split=17, RandomForestClassifier__n_estimators=100)
_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.
_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.
_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.

Generation 2 - Current Pareto front scores:

-1	0.8

## To have parallel processing while training we use

In [22]:
import multiprocessing
if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver', force=True)
    tpot =  TPOTClassifier(generations=2, population_size=20, verbosity=2,n_jobs = 20, random_state=50)
    tpot.fit(X_train, y_train)

Version 0.12.1 of tpot is outdated. Version 0.12.2 was released Friday February 23, 2024.


Optimization Progress:   0%|          | 0/60 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8061324574983516

Generation 2 - Current best internal CV score: 0.8063213138958172

Best pipeline: ExtraTreesClassifier(RobustScaler(input_matrix), bootstrap=True, criterion=entropy, max_features=0.6000000000000001, min_samples_leaf=13, min_samples_split=14, n_estimators=100)


In [23]:
print(tpot.score(X_test, y_test))

0.8023850085178875


## Now we convert the whole pipeline to sklearn code which we can copy paste over our notebook for further usage

In [24]:
tpot.export('tpot_churn_pipeline.py')
!cat tpot_churn_pipeline.py

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=50)

# Average CV score on the training set was: 0.8063213138958172
exported_pipeline = make_pipeline(
    RobustScaler(),
    ExtraTreesClassifier(bootstrap=True, criterion="entropy", max_features=0.6000000000000001, min_samples_leaf=13, min_samples_split=14, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(ex

In [25]:
tpot.evaluated_individuals_

{'ExtraTreesClassifier(input_matrix, ExtraTreesClassifier__bootstrap=False, ExtraTreesClassifier__criterion=entropy, ExtraTreesClassifier__max_features=0.1, ExtraTreesClassifier__min_samples_leaf=5, ExtraTreesClassifier__min_samples_split=8, ExtraTreesClassifier__n_estimators=100)': {'generation': 0,
  'mutation_count': 0,
  'crossover_count': 0,
  'predecessor': ('ROOT',),
  'operator_count': 1,
  'internal_cv_score': 0.7902292795504716},
 'KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=29, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)': {'generation': 0,
  'mutation_count': 0,
  'crossover_count': 0,
  'predecessor': ('ROOT',),
  'operator_count': 1,
  'internal_cv_score': 0.7807606576646312},
 'BernoulliNB(GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=1.0, GradientBoostingClassifier__max_depth=3, GradientBoostingClassifier__max_features=0.8, GradientBoostingClassifier__min_samples_leaf=11, GradientBoostin

In [26]:
tpot.fitted_pipeline_



<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>H2O AutoML</b></div>

H2O AutoML integrates advanced, distributed implementations of machine learning algorithms available in Java, Python, Spark, Scala, and R. It supports deployment on platforms like Spark servers and AWS, facilitated by a web GUI that uses JSON for implementing algorithms.

The primary benefit of H2O AutoML lies in its automation of data processing, model training, tuning, and ensemble construction. This automation allows developers to focus on tasks like data collection, feature engineering, and model deployment.

Key functionalities of H2O AutoML include:
- Providing essential data processing capabilities integrated into all H2O algorithms.
- Training a grid of algorithms such as Gradient Boosting Machines (GBMs), Deep Neural Networks (DNNs), and Generalized Linear Models (GLMs) using carefully selected hyperparameter settings.
- Tuning individual models through cross-validation techniques.
- Building two types of Stacked Ensembles: one optimized for model performance and another optimized for production use.
- Generating a sorted "Leaderboard" that ranks all models based on their performance.
- Offering easy export capabilities for deploying models into production environments.

In [27]:
#java env becuase h2o automl is written in java
!apt-get install default-jre
!java -version

Reading package lists... Done
Building dependency tree       
Reading state information... Done
default-jre is already the newest version (2:1.11-72).
0 upgraded, 0 newly installed, 0 to remove and 75 not upgraded.
openjdk version "11.0.22" 2024-01-16
OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu220.04.1)
OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu220.04.1, mixed mode, sharing)


### H2O AutoML can do data preprocessing all by itself it can do categorical, numerical or missing value imputation it can take care of madel selection and give a nice leaderboard view based upon different hyperparameters and it gives a deployment ready code and it includes gpu in xgboost

In [28]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.22" 2024-01-16; OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu220.04.1); OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu220.04.1, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.10/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpmad184sf
  JVM stdout: /tmp/tmpmad184sf/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpmad184sf/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.2
H2O_cluster_version_age:,2 months and 5 days
H2O_cluster_name:,H2O_from_python_unknownUser_l2lfaf
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.250 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


In [29]:
from h2o.automl import H2OAutoML
churn_df = h2o.import_file('/kaggle/input/titanic/train.csv')
churn_df.describe()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
type,int,int,int,string,enum,real,int,int,int,real,enum,enum
mins,1.0,0.0,1.0,,,0.42,0.0,0.0,693.0,0.0,,
mean,446.0,0.3838383838383838,2.3086419753086447,,,29.69911764705884,0.5230078563411893,0.3815937149270483,260318.5491679275,32.20420796857465,,
maxs,891.0,1.0,3.0,,,80.0,8.0,6.0,3101298.0,512.3292,,
sigma,257.3538420152301,0.4865924542648575,0.8360712409770491,,,14.526497332334035,1.1027434322934315,0.8060572211299488,471609.26868834975,49.69342859718089,,
zeros,0,549,0,0,,0,608,678,0,15,,
missing,0,0,0,0,0,177,0,0,230,0,687,2
0,1.0,0.0,3.0,"Braund, Mr. Owen Harris",male,22.0,1.0,0.0,,7.25,,S
1,2.0,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1.0,0.0,,71.2833,C85,C
2,3.0,1.0,3.0,"Heikkinen, Miss. Laina",female,26.0,0.0,0.0,,7.925,,S


In [30]:
churn_df.types

{'PassengerId': 'int',
 'Survived': 'int',
 'Pclass': 'int',
 'Name': 'string',
 'Sex': 'enum',
 'Age': 'real',
 'SibSp': 'int',
 'Parch': 'int',
 'Ticket': 'int',
 'Fare': 'real',
 'Cabin': 'enum',
 'Embarked': 'enum'}

In [31]:
#70:15:15 split
churn_train,churn_test,churn_valid = churn_df.split_frame(ratios=[.7, .15])
churn_train

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,,7.925,,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450.0,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877.0,8.4583,,Q
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909.0,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742.0,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736.0,30.0708,,C
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783.0,26.55,C103,S
13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,,8.05,,S


In [32]:
y = "Sex"
x = churn_df.columns
x.remove(y)
x.remove("PassengerId")  # Remove "CustomerId"
x.remove("Name")  # Remove "RowNumber"

<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">Key parameters for H2O's AutoML</div>


1. **nfolds**: 
   - **Purpose**: Number of folds for cross-validation.
   - **Default**: 5 (cross-validation) if not set to 0.
   - **Usage**: Set to 0 if validation data is provided.

2. **max_runtime_secs**:
   - **Purpose**: Maximum duration for the AutoML run.
   - **Default**: 0 (no limit). Dynamically set to 1 hour if both `max_runtime_secs` and `max_models` are not specified.
   
3. **max_models**:
   - **Purpose**: Maximum number of models to build, excluding Stacked Ensemble models.
   - **Default**: None.
   - **Usage**: Set to ensure reproducibility, so models are trained until convergence without time constraints.

4. **x** (optional):
   - **Purpose**: List of predictor columns.
   - **Usage**: Specify if excluding columns from prediction. If using all columns, leave unset.

5. **validation_frame** (optional):
   - **Purpose**: Validation dataset for early stopping.
   - **Usage**: Only used when `nfolds == 0`. Ignored if `nfolds > 1`.

6. **balance_classes**:
   - **Purpose**: Balance class distribution by oversampling minority classes.
   - **Default**: False.
   - **Usage**: Applicable for classification tasks.

7. **class_sampling_factors**:
   - **Purpose**: Specify per-class over/under-sampling ratios.
   - **Usage**: Requires `balance_classes` set to True.

8. **max_after_balance_size**:
   - **Purpose**: Maximum size of training data after balancing.
   - **Default**: 5.0.
   - **Usage**: Value can be less than 1.0.

9. **max_runtime_secs_per_model**:
   - **Purpose**: Maximum time for training each individual model.
   - **Default**: 0 (disabled).

10. **stopping_metric**:
    - **Purpose**: Metric for early stopping.
    - **Default**: AUTO (logloss for classification, deviance for regression).
    - **Options**: deviance, logloss, MSE, RMSE, MAE, RMSLE, AUC, AUCPR, lift_top_group, misclassification, mean_per_class_error.

11. **seed**:
    - **Purpose**: Seed for reproducibility.
    - **Default**: None.
    - **Usage**: Use with `max_models` for reproducibility, exclude "DeepLearning" in `exclude_algos` for reproducibility.

12. **exclude_algos**:
    - **Purpose**: Algorithms to exclude.
    - **Default**: None.
    - **Example**: ["GLM", "DeepLearning", "DRF"], etc.

13. **include_algos**:
    - **Purpose**: Algorithms to include.
    - **Default**: None.
    - **Example**: ["GLM", "DeepLearning", "DRF"], etc.
    - **Options**: `DRF` (This includes both the Distributed Random Forest (DRF) and Extremely Randomized Trees (XRT) models. Refer to the Extremely Randomized Trees section in the DRF chapter and the histogram_type parameter description for more information.), `GLM` (Generalized Linear Model with regularization), `XGBoost` (XGBoost GBM), `GBM`(H2O GBM), `DeepLearning` (Fully-connected multi-layer artificial neural network), `StackedEnsemble` (Stacked Ensembles, includes an ensemble of all the base models and ensembles using subsets of the base models)

14. **verbosity** (optional):
    - **Purpose**: Verbosity of training messages.
    - **Default**: None.
    - **Options**: "debug", "info", "warn".
15. y: This argument is the name (or index) of the response column.
16. training_frame: Specifies the training set.


In [33]:
aml = H2OAutoML(max_models = 20, seed = 1, verbosity="NULL", nfolds=0)

In [34]:
aml.train(x = x, y = y, training_frame = churn_train, validation_frame=churn_valid)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees
,30.0

Unnamed: 0,female,male,Error,Rate
female,163.0,56.0,0.2557,(56.0/219.0)
male,19.0,380.0,0.0476,(19.0/399.0)
Total,182.0,436.0,0.1214,(75.0/618.0)

metric,threshold,value,idx
max f1,0.382865,0.9101796,265.0
max f2,0.2309137,0.9389895,310.0
max f0point5,0.6324608,0.9114448,223.0
max accuracy,0.382865,0.8786408,265.0
max precision,0.9887664,1.0,0.0
max recall,0.1264622,1.0,352.0
max specificity,0.9887664,1.0,0.0
max absolute_mcc,0.382865,0.7310695,265.0
max min_per_class_accuracy,0.6880285,0.8584475,215.0
max mean_per_class_accuracy,0.6324608,0.8703265,223.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0194175,0.9849159,1.5488722,1.5488722,1.0,0.9859144,1.0,0.9859144,0.0300752,0.0300752,54.887218,54.887218,0.0300752
2,0.0210356,0.9842619,1.5488722,1.5488722,1.0,0.9848849,1.0,0.9858352,0.0025063,0.0325815,54.887218,54.887218,0.0325815
3,0.0323625,0.9778281,1.5488722,1.5488722,1.0,0.981058,1.0,0.9841632,0.0175439,0.0501253,54.887218,54.887218,0.0501253
4,0.0404531,0.9757569,1.5488722,1.5488722,1.0,0.9771071,1.0,0.982752,0.0125313,0.0626566,54.887218,54.887218,0.0626566
5,0.0501618,0.9747563,1.5488722,1.5488722,1.0,0.9753067,1.0,0.981311,0.0150376,0.0776942,54.887218,54.887218,0.0776942
6,0.1019417,0.9639223,1.5488722,1.5488722,1.0,0.9692769,1.0,0.9751984,0.0802005,0.1578947,54.887218,54.887218,0.1578947
7,0.1504854,0.9520301,1.5488722,1.5488722,1.0,0.9600692,1.0,0.970318,0.075188,0.2330827,54.887218,54.887218,0.2330827
8,0.2022654,0.9426107,1.5004699,1.5364812,0.96875,0.9471503,0.992,0.9643871,0.0776942,0.3107769,50.0469925,53.6481203,0.3062107
9,0.3090615,0.9133902,1.5488722,1.5407629,1.0,0.9285796,0.9947644,0.9520138,0.1654135,0.4761905,54.887218,54.0762902,0.4716243
10,0.4012945,0.8705297,1.3586598,1.4989086,0.877193,0.8903044,0.9677419,0.9378306,0.1253133,0.6015038,35.8659807,49.8908562,0.5649741

Unnamed: 0,female,male,Error,Rate
female,27.0,22.0,0.449,(22.0/49.0)
male,6.0,79.0,0.0706,(6.0/85.0)
Total,33.0,101.0,0.209,(28.0/134.0)

metric,threshold,value,idx
max f1,0.3617008,0.8494624,92.0
max f2,0.0612044,0.9042553,120.0
max f0point5,0.7601361,0.8579088,65.0
max accuracy,0.5597591,0.7985075,82.0
max precision,0.9290709,0.9736842,32.0
max recall,0.0612044,1.0,120.0
max specificity,0.9861709,0.9795918,0.0
max absolute_mcc,0.7601361,0.5695853,65.0
max min_per_class_accuracy,0.7601361,0.7529412,65.0
max mean_per_class_accuracy,0.7601361,0.7948379,65.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0149254,0.9848956,0.7882353,0.7882353,0.5,0.9855434,0.5,0.9855434,0.0117647,0.0117647,-21.1764706,-21.1764706,-0.0086435
2,0.0223881,0.979993,1.5764706,1.0509804,1.0,0.9848545,0.6666667,0.9853138,0.0117647,0.0235294,57.6470588,5.0980392,0.0031212
3,0.0373134,0.9774886,1.5764706,1.2611765,1.0,0.9774886,0.8,0.9821837,0.0235294,0.0470588,57.6470588,26.1176471,0.0266507
4,0.0447761,0.9769014,1.5764706,1.3137255,1.0,0.9769706,0.8333333,0.9813148,0.0117647,0.0588235,57.6470588,31.372549,0.0384154
5,0.0597015,0.9767544,1.5764706,1.3794118,1.0,0.9767544,0.875,0.9801747,0.0235294,0.0823529,57.6470588,37.9411765,0.0619448
6,0.1044776,0.9664846,1.5764706,1.4638655,1.0,0.9715107,0.9285714,0.9764616,0.0705882,0.1529412,57.6470588,46.3865546,0.132533
7,0.1492537,0.9565081,1.5764706,1.4976471,1.0,0.9643411,0.95,0.9728254,0.0705882,0.2235294,57.6470588,49.7647059,0.2031212
8,0.2014925,0.9431782,1.5764706,1.5180828,1.0,0.9498816,0.962963,0.966877,0.0823529,0.3058824,57.6470588,51.8082789,0.2854742
9,0.2985075,0.9280493,1.4552036,1.4976471,0.9230769,0.935408,0.95,0.9566496,0.1411765,0.4470588,45.520362,49.7647059,0.4062425
10,0.4029851,0.8761506,1.3512605,1.459695,0.8571429,0.8997861,0.9259259,0.9419072,0.1411765,0.5882353,35.1260504,45.9694989,0.5066026

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2024-07-18 19:42:51,1.265 sec,0.0,0.5,0.6931472,0.5,0.6456311,1.0,0.3543689,0.5,0.6931472,0.5,0.6343284,1.0,0.3656716
,2024-07-18 19:42:51,1.388 sec,5.0,0.3440672,0.3942783,0.9156796,0.9513902,1.5488722,0.1391586,0.3845806,0.4639985,0.8537815,0.8880945,1.2611765,0.1791045
,2024-07-18 19:42:51,1.526 sec,10.0,0.3252119,0.3485755,0.9268148,0.9588287,1.5488722,0.1294498,0.3803394,0.4530699,0.8565426,0.9003672,1.5764706,0.1716418
,2024-07-18 19:42:51,1.646 sec,15.0,0.315965,0.3287884,0.9306771,0.9601123,1.5488722,0.131068,0.3920004,0.4810211,0.8422569,0.8807458,0.7882353,0.2089552
,2024-07-18 19:42:51,1.805 sec,20.0,0.3119422,0.3218062,0.9342363,0.9625394,1.5488722,0.1245955,0.3968897,0.4941624,0.8376951,0.8700371,0.7882353,0.2089552
,2024-07-18 19:42:51,1.988 sec,25.0,0.309437,0.3164942,0.9358385,0.963128,1.5488722,0.1245955,0.3989507,0.5000192,0.8367347,0.869691,0.7882353,0.2164179
,2024-07-18 19:42:52,2.198 sec,30.0,0.307897,0.3133615,0.9377382,0.9642097,1.5488722,0.1213592,0.398048,0.4996157,0.8410564,0.8724385,0.7882353,0.2089552

variable,relative_importance,scaled_importance,percentage
Survived,390.0239563,1.0,0.4887502
Fare,109.2893524,0.2802119,0.1369536
Ticket,77.3205032,0.1982455,0.0968925
Age,65.3203812,0.1674779,0.0818548
Parch,60.5690117,0.1552956,0.0759008
Pclass,44.2675362,0.1134995,0.0554729
SibSp,20.3178234,0.0520938,0.0254608
Embarked.Q,11.8695116,0.0304328,0.014874
Embarked.S,10.1293707,0.0259712,0.0126934
Embarked.C,4.8225498,0.0123648,0.0060433


In [35]:
#Get leaderboard with all possible columns and use rows=lb.nrows to see all models else by default it is 10
lb = h2o.automl.get_leaderboard(aml)
lb.head(rows=lb.nrows)

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
XGBoost_grid_1_AutoML_1_20240718_194240_model_3,0.841056,0.499616,0.872439,0.259784,0.398048,0.158442
XGBoost_grid_1_AutoML_1_20240718_194240_model_1,0.840336,0.473992,0.896505,0.233854,0.392275,0.15388
XGBoost_1_AutoML_1_20240718_194240,0.835414,0.475933,0.889137,0.222089,0.391902,0.153587
XGBoost_2_AutoML_1_20240718_194240,0.832653,0.488078,0.878162,0.211885,0.392767,0.154266
XGBoost_grid_1_AutoML_1_20240718_194240_model_2,0.823529,0.502937,0.878288,0.22521,0.402544,0.162041
GBM_4_AutoML_1_20240718_194240,0.820648,0.514175,0.873273,0.262905,0.409871,0.167994
GBM_1_AutoML_1_20240718_194240,0.819688,0.500871,0.893233,0.245618,0.406671,0.165381
GBM_3_AutoML_1_20240718_194240,0.819328,0.521342,0.866612,0.25114,0.407987,0.166454
GLM_1_AutoML_1_20240718_194240,0.818487,0.491513,0.885082,0.191837,0.399668,0.159734
GBM_grid_1_AutoML_1_20240718_194240_model_1,0.816567,0.502228,0.882468,0.277431,0.405714,0.164604


In [36]:
churn_pred=aml.leader.predict(churn_test)

xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%


In [37]:
churn_pred.head()

predict,female,male
male,0.606885,0.393115
female,0.783454,0.216546
male,0.0133244,0.986676
male,0.372657,0.627343
female,0.917057,0.0829433
male,0.255488,0.744512
male,0.163769,0.836231
male,0.0437453,0.956255
male,0.076695,0.923305
male,0.606885,0.393115


In [38]:
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
#se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
#metalearner = h2o.get_model(se.metalearner()['name'])
model_ids


with h2o.utils.threading.local_context(polars_enabled=True, datatable_enabled=True):
    pandas_df = h2o_df.as_data_frame()



['XGBoost_grid_1_AutoML_1_20240718_194240_model_3',
 'XGBoost_grid_1_AutoML_1_20240718_194240_model_1',
 'XGBoost_1_AutoML_1_20240718_194240',
 'XGBoost_2_AutoML_1_20240718_194240',
 'XGBoost_grid_1_AutoML_1_20240718_194240_model_2',
 'GBM_4_AutoML_1_20240718_194240',
 'GBM_1_AutoML_1_20240718_194240',
 'GBM_3_AutoML_1_20240718_194240',
 'GLM_1_AutoML_1_20240718_194240',
 'GBM_grid_1_AutoML_1_20240718_194240_model_1',
 'DeepLearning_grid_2_AutoML_1_20240718_194240_model_1',
 'XGBoost_3_AutoML_1_20240718_194240',
 'GBM_5_AutoML_1_20240718_194240',
 'DeepLearning_1_AutoML_1_20240718_194240',
 'GBM_2_AutoML_1_20240718_194240',
 'DRF_1_AutoML_1_20240718_194240',
 'DeepLearning_grid_1_AutoML_1_20240718_194240_model_1',
 'DeepLearning_grid_3_AutoML_1_20240718_194240_model_1',
 'GBM_grid_1_AutoML_1_20240718_194240_model_2',
 'XRT_1_AutoML_1_20240718_194240']

In [39]:
out=h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
out

Unnamed: 0,number_of_trees
,30.0

Unnamed: 0,female,male,Error,Rate
female,163.0,56.0,0.2557,(56.0/219.0)
male,19.0,380.0,0.0476,(19.0/399.0)
Total,182.0,436.0,0.1214,(75.0/618.0)

metric,threshold,value,idx
max f1,0.382865,0.9101796,265.0
max f2,0.2309137,0.9389895,310.0
max f0point5,0.6324608,0.9114448,223.0
max accuracy,0.382865,0.8786408,265.0
max precision,0.9887664,1.0,0.0
max recall,0.1264622,1.0,352.0
max specificity,0.9887664,1.0,0.0
max absolute_mcc,0.382865,0.7310695,265.0
max min_per_class_accuracy,0.6880285,0.8584475,215.0
max mean_per_class_accuracy,0.6324608,0.8703265,223.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0194175,0.9849159,1.5488722,1.5488722,1.0,0.9859144,1.0,0.9859144,0.0300752,0.0300752,54.887218,54.887218,0.0300752
2,0.0210356,0.9842619,1.5488722,1.5488722,1.0,0.9848849,1.0,0.9858352,0.0025063,0.0325815,54.887218,54.887218,0.0325815
3,0.0323625,0.9778281,1.5488722,1.5488722,1.0,0.981058,1.0,0.9841632,0.0175439,0.0501253,54.887218,54.887218,0.0501253
4,0.0404531,0.9757569,1.5488722,1.5488722,1.0,0.9771071,1.0,0.982752,0.0125313,0.0626566,54.887218,54.887218,0.0626566
5,0.0501618,0.9747563,1.5488722,1.5488722,1.0,0.9753067,1.0,0.981311,0.0150376,0.0776942,54.887218,54.887218,0.0776942
6,0.1019417,0.9639223,1.5488722,1.5488722,1.0,0.9692769,1.0,0.9751984,0.0802005,0.1578947,54.887218,54.887218,0.1578947
7,0.1504854,0.9520301,1.5488722,1.5488722,1.0,0.9600692,1.0,0.970318,0.075188,0.2330827,54.887218,54.887218,0.2330827
8,0.2022654,0.9426107,1.5004699,1.5364812,0.96875,0.9471503,0.992,0.9643871,0.0776942,0.3107769,50.0469925,53.6481203,0.3062107
9,0.3090615,0.9133902,1.5488722,1.5407629,1.0,0.9285796,0.9947644,0.9520138,0.1654135,0.4761905,54.887218,54.0762902,0.4716243
10,0.4012945,0.8705297,1.3586598,1.4989086,0.877193,0.8903044,0.9677419,0.9378306,0.1253133,0.6015038,35.8659807,49.8908562,0.5649741

Unnamed: 0,female,male,Error,Rate
female,27.0,22.0,0.449,(22.0/49.0)
male,6.0,79.0,0.0706,(6.0/85.0)
Total,33.0,101.0,0.209,(28.0/134.0)

metric,threshold,value,idx
max f1,0.3617008,0.8494624,92.0
max f2,0.0612044,0.9042553,120.0
max f0point5,0.7601361,0.8579088,65.0
max accuracy,0.5597591,0.7985075,82.0
max precision,0.9290709,0.9736842,32.0
max recall,0.0612044,1.0,120.0
max specificity,0.9861709,0.9795918,0.0
max absolute_mcc,0.7601361,0.5695853,65.0
max min_per_class_accuracy,0.7601361,0.7529412,65.0
max mean_per_class_accuracy,0.7601361,0.7948379,65.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0149254,0.9848956,0.7882353,0.7882353,0.5,0.9855434,0.5,0.9855434,0.0117647,0.0117647,-21.1764706,-21.1764706,-0.0086435
2,0.0223881,0.979993,1.5764706,1.0509804,1.0,0.9848545,0.6666667,0.9853138,0.0117647,0.0235294,57.6470588,5.0980392,0.0031212
3,0.0373134,0.9774886,1.5764706,1.2611765,1.0,0.9774886,0.8,0.9821837,0.0235294,0.0470588,57.6470588,26.1176471,0.0266507
4,0.0447761,0.9769014,1.5764706,1.3137255,1.0,0.9769706,0.8333333,0.9813148,0.0117647,0.0588235,57.6470588,31.372549,0.0384154
5,0.0597015,0.9767544,1.5764706,1.3794118,1.0,0.9767544,0.875,0.9801747,0.0235294,0.0823529,57.6470588,37.9411765,0.0619448
6,0.1044776,0.9664846,1.5764706,1.4638655,1.0,0.9715107,0.9285714,0.9764616,0.0705882,0.1529412,57.6470588,46.3865546,0.132533
7,0.1492537,0.9565081,1.5764706,1.4976471,1.0,0.9643411,0.95,0.9728254,0.0705882,0.2235294,57.6470588,49.7647059,0.2031212
8,0.2014925,0.9431782,1.5764706,1.5180828,1.0,0.9498816,0.962963,0.966877,0.0823529,0.3058824,57.6470588,51.8082789,0.2854742
9,0.2985075,0.9280493,1.4552036,1.4976471,0.9230769,0.935408,0.95,0.9566496,0.1411765,0.4470588,45.520362,49.7647059,0.4062425
10,0.4029851,0.8761506,1.3512605,1.459695,0.8571429,0.8997861,0.9259259,0.9419072,0.1411765,0.5882353,35.1260504,45.9694989,0.5066026

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2024-07-18 19:42:51,1.265 sec,0.0,0.5,0.6931472,0.5,0.6456311,1.0,0.3543689,0.5,0.6931472,0.5,0.6343284,1.0,0.3656716
,2024-07-18 19:42:51,1.388 sec,5.0,0.3440672,0.3942783,0.9156796,0.9513902,1.5488722,0.1391586,0.3845806,0.4639985,0.8537815,0.8880945,1.2611765,0.1791045
,2024-07-18 19:42:51,1.526 sec,10.0,0.3252119,0.3485755,0.9268148,0.9588287,1.5488722,0.1294498,0.3803394,0.4530699,0.8565426,0.9003672,1.5764706,0.1716418
,2024-07-18 19:42:51,1.646 sec,15.0,0.315965,0.3287884,0.9306771,0.9601123,1.5488722,0.131068,0.3920004,0.4810211,0.8422569,0.8807458,0.7882353,0.2089552
,2024-07-18 19:42:51,1.805 sec,20.0,0.3119422,0.3218062,0.9342363,0.9625394,1.5488722,0.1245955,0.3968897,0.4941624,0.8376951,0.8700371,0.7882353,0.2089552
,2024-07-18 19:42:51,1.988 sec,25.0,0.309437,0.3164942,0.9358385,0.963128,1.5488722,0.1245955,0.3989507,0.5000192,0.8367347,0.869691,0.7882353,0.2164179
,2024-07-18 19:42:52,2.198 sec,30.0,0.307897,0.3133615,0.9377382,0.9642097,1.5488722,0.1213592,0.398048,0.4996157,0.8410564,0.8724385,0.7882353,0.2089552

variable,relative_importance,scaled_importance,percentage
Survived,390.0239563,1.0,0.4887502
Fare,109.2893524,0.2802119,0.1369536
Ticket,77.3205032,0.1982455,0.0968925
Age,65.3203812,0.1674779,0.0818548
Parch,60.5690117,0.1552956,0.0759008
Pclass,44.2675362,0.1134995,0.0554729
SibSp,20.3178234,0.0520938,0.0254608
Embarked.Q,11.8695116,0.0304328,0.014874
Embarked.S,10.1293707,0.0259712,0.0126934
Embarked.C,4.8225498,0.0123648,0.0060433


In [40]:
out.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'XGBoost_grid_1_AutoML_1_20240718_194240_model_3',
   'type': 'Key<Model>',
   'URL': '/3/Models/XGBoost_grid_1_AutoML_1_20240718_194240_model_3'},
  'input': None},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'AutoML_1_20240718_194240_training_py_2_sid_8601',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/AutoML_1_20240718_194240_training_py_2_sid_8601'},
  'input': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'AutoML_1_20240718_194240_training_py_2_sid_8601',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/AutoML_1_20240718_194240_training_py_2_sid_8601'}},
 'validation_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name'

### Here we can see a lot of parameters which are of H20 XGBoost parameter but yeah we can convert it to normal XGBoost parameters easily using

In [41]:
out.convert_H2OXGBoostParams_2_XGBoostParams()

({'min_child_weight': 5.0,
  'normalize_type': 'tree',
  'eta': 0.3,
  'objective': 'binary:logistic',
  'silent': True,
  'nthread': 4,
  'seed': 3,
  'colsample_bylevel': 1.0,
  'rate_drop': 0.0,
  'max_bin': 256,
  'one_drop': '0',
  'sample_type': 'uniform',
  'max_depth': 6,
  'lambda': 0.1,
  'colsample_bytree': 1.0,
  'gamma': 0.0,
  'gpu_id': 0,
  'alpha': 0.001,
  'booster': 'dart',
  'grow_policy': 'depthwise',
  'skip_drop': 0.0,
  'nround': 10000,
  'max_delta_step': 0.0,
  'subsample': 1.0,
  'tree_method': 'gpu_hist'},
 30)

In [42]:
out.varimp_plot()

<h2o.plot._plot_result._MObject at 0x7babe008d840>

In [43]:
# to check feature importance
out.varimp_plot()

<h2o.plot._plot_result._MObject at 0x7bab6a076c80>

### I dont know Why i am not getting the feature importance curve

In [44]:
out2=h2o.get_model([mid for mid in model_ids if "DeepLearning" in mid][0])
out2

Unnamed: 0,layer,units,type,dropout,l1,l2,mean_rate,rate_rms,momentum,mean_weight,weight_rms,mean_bias,bias_rms
,1,159,Input,15.0,,,,,,,,,
,2,100,RectifierDropout,10.0,0.0,0.0,0.2461594,0.4250294,0.0,-0.000238,0.0912909,0.468968,0.1004157
,3,100,RectifierDropout,10.0,0.0,0.0,0.016449,0.0770666,0.0,-0.0069629,0.1007419,0.9795329,0.020421
,4,2,Softmax,,0.0,0.0,0.0010504,0.0007491,0.0,-0.0076521,0.5481899,8.7e-06,0.0018729

Unnamed: 0,female,male,Error,Rate
female,147.0,72.0,0.3288,(72.0/219.0)
male,56.0,343.0,0.1404,(56.0/399.0)
Total,203.0,415.0,0.2071,(128.0/618.0)

metric,threshold,value,idx
max f1,0.2585055,0.8427518,264.0
max f2,0.0198022,0.9047619,393.0
max f0point5,0.5837059,0.8749266,193.0
max accuracy,0.4358168,0.7993528,212.0
max precision,0.9923852,1.0,0.0
max recall,0.0198022,1.0,393.0
max specificity,0.9923852,1.0,0.0
max absolute_mcc,0.5837059,0.5930731,193.0
max min_per_class_accuracy,0.3823616,0.7945205,221.0
max mean_per_class_accuracy,0.5837059,0.8095066,193.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0113269,0.984581,1.5488722,1.5488722,1.0,0.9899339,1.0,0.9899339,0.0175439,0.0175439,54.887218,54.887218,0.0175439
2,0.0210356,0.9811687,1.5488722,1.5488722,1.0,0.9819674,1.0,0.9862571,0.0150376,0.0325815,54.887218,54.887218,0.0325815
3,0.0307443,0.9774548,1.5488722,1.5488722,1.0,0.9795208,1.0,0.9841298,0.0150376,0.047619,54.887218,54.887218,0.047619
4,0.0404531,0.9731623,1.5488722,1.5488722,1.0,0.975589,1.0,0.98208,0.0150376,0.0626566,54.887218,54.887218,0.0626566
5,0.0501618,0.9706761,1.5488722,1.5488722,1.0,0.9720808,1.0,0.9801447,0.0150376,0.0776942,54.887218,54.887218,0.0776942
6,0.1003236,0.948092,1.4989086,1.5238904,0.9677419,0.9590707,0.983871,0.9696077,0.075188,0.1528822,49.8908562,52.3890371,0.148316
7,0.1504854,0.9303952,1.4989086,1.5155631,0.9677419,0.9393749,0.9784946,0.9595301,0.075188,0.2280702,49.8908562,51.5563101,0.2189378
8,0.2006472,0.9045009,1.4989086,1.5113995,0.9677419,0.9180378,0.9758065,0.949157,0.075188,0.3032581,49.8908562,51.1399466,0.2895595
9,0.3009709,0.8740665,1.3989813,1.4739268,0.9032258,0.8913787,0.9516129,0.9298976,0.1403509,0.443609,39.8981324,47.3926752,0.4025131
10,0.3996764,0.8240715,1.4219154,1.4610819,0.9180328,0.8495129,0.9433198,0.9100455,0.1403509,0.5839599,42.1915444,46.1081854,0.520033

Unnamed: 0,female,male,Error,Rate
female,34.0,15.0,0.3061,(15.0/49.0)
male,11.0,74.0,0.1294,(11.0/85.0)
Total,45.0,89.0,0.194,(26.0/134.0)

metric,threshold,value,idx
max f1,0.2347433,0.8505747,88.0
max f2,0.0176995,0.9004237,131.0
max f0point5,0.4025472,0.8651399,76.0
max accuracy,0.4025472,0.8059701,76.0
max precision,0.9861867,1.0,0.0
max recall,0.0176995,1.0,131.0
max specificity,0.9861867,1.0,0.0
max absolute_mcc,0.4025472,0.6003929,76.0
max min_per_class_accuracy,0.4025472,0.8,76.0
max mean_per_class_accuracy,0.4025472,0.8081633,76.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0149254,0.9799607,1.5764706,1.5764706,1.0,0.9842475,1.0,0.9842475,0.0235294,0.0235294,57.6470588,57.6470588,0.0235294
2,0.0223881,0.9744337,0.0,1.0509804,0.0,0.9751941,0.6666667,0.9812297,0.0,0.0235294,-100.0,5.0980392,0.0031212
3,0.0298507,0.9719707,1.5764706,1.1823529,1.0,0.974042,0.75,0.9794328,0.0117647,0.0352941,57.6470588,18.2352941,0.014886
4,0.0447761,0.9655961,1.5764706,1.3137255,1.0,0.968783,0.8333333,0.9758829,0.0235294,0.0588235,57.6470588,31.372549,0.0384154
5,0.0522388,0.9619607,1.5764706,1.3512605,1.0,0.9655532,0.8571429,0.9744072,0.0117647,0.0705882,57.6470588,35.1260504,0.0501801
6,0.1044776,0.9400997,1.5764706,1.4638655,1.0,0.9484074,0.9285714,0.9614073,0.0823529,0.1529412,57.6470588,46.3865546,0.132533
7,0.1492537,0.9226551,1.3137255,1.4188235,0.8333333,0.9340412,0.9,0.9531975,0.0588235,0.2117647,31.372549,41.8823529,0.1709484
8,0.2014925,0.8975545,1.5764706,1.459695,1.0,0.9088962,0.9259259,0.941712,0.0823529,0.2941176,57.6470588,45.9694989,0.2533013
9,0.2985075,0.8812297,1.4552036,1.4582353,0.9230769,0.8909775,0.925,0.9252233,0.1411765,0.4352941,45.520362,45.8235294,0.3740696
10,0.4029851,0.8023986,1.3512605,1.4305011,0.8571429,0.8426318,0.9074074,0.9038107,0.1411765,0.5764706,35.1260504,43.0501089,0.4744298

Unnamed: 0,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_r2,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2024-07-18 19:43:19,0.000 sec,,0.0,0,0.0,,,,,,,,,,,,,,
,2024-07-18 19:43:19,0.865 sec,7754 obs/sec,10.0,1,6180.0,0.394559,0.4929237,0.3195695,0.8607993,0.9206503,1.5488722,0.2071197,0.4114906,0.5676391,0.2700141,0.8156062,0.8659782,1.5764706,0.1940299
,2024-07-18 19:43:24,5.902 sec,12801 obs/sec,120.0,12,74160.0,0.3269906,0.3504212,0.5326626,0.9239823,0.9522247,1.5488722,0.131068,0.4055799,0.6344493,0.2908349,0.7923169,0.805059,0.7882353,0.1865672
,2024-07-18 19:43:29,10.924 sec,14912 obs/sec,260.0,26,160680.0,0.2965913,0.2893881,0.6155173,0.9414919,0.9642595,1.5488722,0.1116505,0.4077858,0.6349575,0.2830997,0.7985594,0.8139627,0.7882353,0.2089552
,2024-07-18 19:43:35,16.097 sec,17872 obs/sec,460.0,46,284280.0,0.2885151,0.2766559,0.6361711,0.9483927,0.96829,1.5488722,0.1035599,0.4182311,0.6986849,0.245903,0.7877551,0.8262505,1.5764706,0.2014925
,2024-07-18 19:43:40,21.132 sec,20104 obs/sec,680.0,68,420240.0,0.2955911,0.2839439,0.6181062,0.9515627,0.9708333,1.5488722,0.1019417,0.4289938,0.7778733,0.2065918,0.752461,0.7903404,1.5764706,0.2089552
,2024-07-18 19:43:45,26.199 sec,22165 obs/sec,930.0,93,574740.0,0.2882749,0.2674542,0.6367768,0.953594,0.9725504,1.5488722,0.105178,0.4362241,0.8310858,0.1796221,0.7342137,0.7835533,1.5764706,0.2164179
,2024-07-18 19:43:45,26.250 sec,22156 obs/sec,930.0,93,574740.0,0.394559,0.4929237,0.3195695,0.8607993,0.9206503,1.5488722,0.2071197,0.4114906,0.5676391,0.2700141,0.8156062,0.8659782,1.5764706,0.1940299

variable,relative_importance,scaled_importance,percentage
Cabin.C52,1.0,1.0,0.0069422
Cabin.B79,0.9992890,0.9992890,0.0069372
Cabin.E46,0.9979272,0.9979272,0.0069278
Cabin.E58,0.9902442,0.9902442,0.0068744
Cabin.D9,0.9876831,0.9876831,0.0068566
Cabin.B73,0.9863540,0.9863540,0.0068474
Cabin.E101,0.9806820,0.9806820,0.0068080
Cabin.B69,0.9803655,0.9803655,0.0068058
Cabin.F G63,0.9791802,0.9791802,0.0067976
Cabin.B4,0.9777629,0.9777629,0.0067878


In [45]:
out2.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'DeepLearning_grid_2_AutoML_1_20240718_194240_model_1',
   'type': 'Key<Model>',
   'URL': '/3/Models/DeepLearning_grid_2_AutoML_1_20240718_194240_model_1'},
  'input': None},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'AutoML_1_20240718_194240_training_py_2_sid_8601',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/AutoML_1_20240718_194240_training_py_2_sid_8601'},
  'input': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'AutoML_1_20240718_194240_training_py_2_sid_8601',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/AutoML_1_20240718_194240_training_py_2_sid_8601'}},
 'validation_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'sc

In [46]:
#getting params from deep learning is pretty tough as we dont have any inbuilt function for that so we need to write a function for this
def convert_H2ODeepLearningParams_2_DeepLearningParams(h2o_params):
    dl_params = {}

    # Map H2O Deep Learning parameters to general Deep Learning framework parameters
    if 'activation' in h2o_params:
        dl_params['activation'] = h2o_params['activation']

    if 'hidden' in h2o_params:
        dl_params['hidden_layers'] = h2o_params['hidden']

    if 'epochs' in h2o_params:
        dl_params['epochs'] = h2o_params['epochs']

    if 'train_samples_per_iteration' in h2o_params:
        dl_params['batch_size'] = h2o_params['train_samples_per_iteration']

    if 'rate' in h2o_params:
        dl_params['learning_rate'] = h2o_params['rate']

    if 'rate_annealing' in h2o_params:
        dl_params['learning_rate_decay'] = h2o_params['rate_annealing']

    if 'l1' in h2o_params:
        dl_params['l1_regularization'] = h2o_params['l1']

    if 'l2' in h2o_params:
        dl_params['l2_regularization'] = h2o_params['l2']

    if 'momentum_stable' in h2o_params:
        dl_params['momentum'] = h2o_params['momentum_stable']

    if 'rho' in h2o_params:
        dl_params['rho'] = h2o_params['rho']

    if 'epsilon' in h2o_params:
        dl_params['epsilon'] = h2o_params['epsilon']

    if 'max_w2' in h2o_params:
        dl_params['max_weight'] = h2o_params['max_w2']

    if 'initial_weight_distribution' in h2o_params:
        dl_params['weight_init'] = h2o_params['initial_weight_distribution']

    if 'initial_weight_scale' in h2o_params:
        dl_params['weight_scale'] = h2o_params['initial_weight_scale']

    if 'loss' in h2o_params:
        dl_params['loss_function'] = h2o_params['loss']

    if 'stopping_metric' in h2o_params:
        dl_params['early_stopping_metric'] = h2o_params['stopping_metric']

    if 'stopping_rounds' in h2o_params:
        dl_params['early_stopping_rounds'] = h2o_params['stopping_rounds']

    if 'stopping_tolerance' in h2o_params:
        dl_params['early_stopping_tolerance'] = h2o_params['stopping_tolerance']

    if 'score_interval' in h2o_params:
        dl_params['validation_freq'] = h2o_params['score_interval']

    # Return the converted parameters as a DataFrame
    return pd.DataFrame.from_dict(dl_params, orient='index')


converted_params = convert_H2ODeepLearningParams_2_DeepLearningParams(out2.params)
converted_params

Unnamed: 0,default,actual,input
activation,Rectifier,RectifierWithDropout,RectifierWithDropout
hidden_layers,"[200, 200]","[100, 100]","[100, 100]"
epochs,10.00,10000.00,10000.00
batch_size,-2,-2,-2
learning_rate,0.01,0.01,0.01
learning_rate_decay,0.00,0.00,0.00
l1_regularization,0.00,0.00,0.00
l2_regularization,0.00,0.00,0.00
momentum,0.00,0.00,0.00
rho,0.99,0.90,0.90


In [47]:
out_gbm = h2o.get_model([mid for mid in model_ids if "GBM" in mid][0])
out_gbm.confusion_matrix()

Unnamed: 0,female,male,Error,Rate
female,191.0,28.0,0.1279,(28.0/219.0)
male,11.0,388.0,0.0276,(11.0/399.0)
Total,202.0,416.0,0.0631,(39.0/618.0)




<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>H2O integration with Sklearn</b></div>

<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">h2o.sklearn</div>



The `h2o.sklearn` module introduces a new approach while maintaining compatibility with existing estimators and transformers. Instead of modifying the original `h2o.estimators` and `h2o.transforms`, this module offers autogenerated wrappers layered atop them, including H2OAutoML.

These wrappers are designed to cater to a wide range of use-cases for integrating H2O-3 with sklearn:

- They adopt sklearn's naming conventions, such as `H2OGradientBoostingClassifier`, `H2OAutoMLClassifier` for classifiers (built on top of `H2OGradientBoostingEstimator`, `H2OAutoML`). These ensure that the target variables are automatically categorized.
  
- For regressors, names like `H2OGradientBoostingRegressor`, `H2OAutoMLRegressor` are used (based on `H2OGradientBoostingEstimator`, `H2OAutoML`).
  
- Generic estimators are exposed through names like `H2OGradientBoostingEstimator`, `H2OAutoMLEstimator`, accepting an `estimator_type` parameter (`None`, `'classifier'`, or `'regressor'`).

- The wrappers exclusively expose a sklearn-like API, featuring:
  - Constructors accepting all parameters as keyword arguments, facilitating auto-completion in environments like Jupyter notebooks.
  - `get_params()` and `set_params(**params)`, with `get_params()` returning all possible parameters, not just those explicitly set.
  - Methods like `fit(X, y)`, `predict(X)`, `fit_predict(X, y)` for all estimators (including `H2OAutoML`).
  - `predict_proba(X)`, `predict_log_proba(X)` for classifiers supporting probability predictions.
  - Transformation methods like `transform(X)`, `fit_transform(X, y)`, `inverse_transform(X)` for transformers and estimators that support transformations (e.g., `H2OPrincipalComponentAnalysisEstimator`).
  - `score(X, y)` for estimators, using sklearn's metrics such as `sklearn.metrics.accuracy_score` for classifiers and `sklearn.metrics.r2_score` for regressors.
  - An `estimator` property (available since 3.28.0.2; as `_estimator` in 3.28.0.1), providing a reference to the original `H2OEstimator` or `H2OAutoML` instance, allowing access to additional properties and methods.
  
- `X` and `y` parameters accept various data types (`H2OFrame`, `numpy` arrays, `pandas.DataFrame`). The wrappers attempt to return predictions or transformations in the same format as the input, facilitating seamless integration with sklearn components in `sklearn.pipeline.Pipeline`.

- For ease of use, the wrappers can handle automatic connection management to the local backend, including auto-start, auto-connect, and auto-shutdown when the wrapper is no longer in use. Note that this automatic connection management is disabled if the user initiates a connection using `h2o.init()` beforehand.

In [48]:
import warnings
warnings.filterwarnings(action='ignore')
import numpy as np
import pandas as pd  
# pandas >= 0.19.2 required
train = pd.read_csv("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_train.csv").values
test = pd.read_csv("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_test.csv").values
X_train = train[:,:-1]
y_train = train[:,-1]
X_test = test[:,:-1]
y_test = test[:,-1]
X_train[:10], y_train[:10]

(array([[6.0, 2.2, 4.0, 1.0],
        [5.2, 3.4, 1.4, 0.2],
        [6.9, 3.1, 5.4, 2.1],
        [7.3, 2.9, 6.3, 1.8],
        [7.6, 3.0, 6.6, 2.1],
        [5.6, 3.0, 4.5, 1.5],
        [5.4, 3.4, 1.7, 0.2],
        [6.4, 3.2, 5.3, 2.3],
        [4.5, 2.3, 1.3, 0.3],
        [6.2, 3.4, 5.4, 2.3]], dtype=object),
 array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica',
        'Iris-virginica', 'Iris-virginica', 'Iris-versicolor',
        'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica'],
       dtype=object))

In [49]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from h2o.sklearn import H2OGradientBoostingClassifier

seed = 42

pipeline_mix = Pipeline([
    ("standardize", StandardScaler()),
    ("pca", PCA(n_components=2, random_state=seed)),
    ("classifier", H2OGradientBoostingClassifier(seed=seed))
])

In [50]:
assert 'learn_rate' in pipeline_mix.named_steps.classifier.get_params()
pipeline_mix.set_params(classifier__learn_rate=0.01)
assert pipeline_mix.named_steps.classifier.learn_rate == 0.01

Now that our pipeline is defined, we can train our model. Note that as we haven't initialized H2O-3 yet (normally using the h2o.init() method), then it will be automatically started by the first H2O component encountered in the pipeline.
Please also note the progress bars showing that the training numpy data are converted and uploaded to the H2O backend. If you're annoyed by those progress bars and want to hide them, you should simply initialized H2O using `h2o.init(show_progress=False)`. This can also be done directly in the `h2o.sklearn wrapper`, using for example `H2OGradientBoostingClassifier(seed=seed,init_connection_args=dict(show_progress=False))`

In [51]:
pipeline_mix.fit(X_train, y_train)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


In [52]:
preds = pipeline_mix.predict(X_test)
assert isinstance(preds, np.ndarray)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [53]:
# get accuracy score (automatically calls `predict` on the estimator internally)
pipeline_mix.score(X_test, y_test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


0.94

In [54]:
gbm_wrapper = pipeline_mix.named_steps.classifier
gbm_wrapper.estimator  # use gbm_wrapper._estimator in 3.28.0.1

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,50.0,150.0,17537.0,1.0,4.0,3.6133332,2.0,6.0,4.6266665

Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
30.0,0.0,0.0,0.0,0 / 30
0.0,31.0,3.0,0.0882353,3 / 34
0.0,4.0,32.0,0.1111111,4 / 36
30.0,35.0,35.0,0.07,7 / 100

k,hit_ratio
1,0.93
2,1.0
3,1.0

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc
,2024-07-18 19:44:17,0.002 sec,0.0,0.6666667,1.0986123,0.72,,
,2024-07-18 19:44:17,0.023 sec,1.0,0.6608931,1.0814479,0.07,,
,2024-07-18 19:44:17,0.027 sec,2.0,0.6551584,1.0647012,0.07,,
,2024-07-18 19:44:17,0.031 sec,3.0,0.6494634,1.0483567,0.07,,
,2024-07-18 19:44:17,0.036 sec,4.0,0.6438085,1.0323997,0.07,,
,2024-07-18 19:44:17,0.041 sec,5.0,0.6381944,1.0168164,0.07,,
,2024-07-18 19:44:17,0.046 sec,6.0,0.6326216,1.0015935,0.07,,
,2024-07-18 19:44:17,0.051 sec,7.0,0.6270906,0.9867188,0.07,,
,2024-07-18 19:44:17,0.057 sec,8.0,0.6216019,0.9721803,0.07,,
,2024-07-18 19:44:17,0.063 sec,9.0,0.6161560,0.9579668,0.07,,

variable,relative_importance,scaled_importance,percentage
C1,1747.8596191,1.0,0.9804979
C2,34.764946,0.01989,0.0195021


<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">Using only h2o.sklearn components</div>

In [55]:
from h2o import H2OFrame
from h2o.sklearn import H2OScaler, H2OPCA, H2OGradientBoostingClassifier
seed = 42

pipeline_h2o = Pipeline([
    ("standardize", H2OScaler()),
    ("pca", H2OPCA(k=2, seed=seed)),
    ("classifier", H2OGradientBoostingClassifier(learn_rate=0.05, seed=seed))
])

### Here, as we are using only H2O components, we will look at the behaviour of the pipeline when we feed it with `H2OFrames`, and then when we feed it with numpy arrays.

In [56]:
X_train_h2o, y_train_h2o = H2OFrame(X_train), H2OFrame(y_train)
X_test_h2o, y_test_h2o = H2OFrame(X_test), H2OFrame(y_test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [57]:
pipeline_h2o.fit(X_train_h2o, y_train_h2o)

pca Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
pca prediction progress: |███████████████████████████████████████████████████████| (done) 100%
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


In [58]:
preds = pipeline_h2o.predict(X_test_h2o)
assert isinstance(preds, H2OFrame)
pipeline_h2o.score(X_test_h2o, y_test_h2o)

pca prediction progress: |███████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
pca prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


0.98

### With numpy arrays

In [59]:
pipeline_h2o.fit(X_train, y_train)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
pca Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
pca prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


In [60]:
preds = pipeline_h2o.predict(X_test)
assert isinstance(preds, H2OFrame)
pipeline_h2o.score(X_test, y_test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
pca prediction progress: |███████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
pca prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


0.98

In this context, one might expect that the predictions from the `predict` method would naturally be represented as a numpy array. However, due to the logic implemented in the h2o.sklearn wrappers for data format detection, this expectation isn't reliably met. The default logic is straightforward: for a given estimator wrapper, the output type matches the input type:

- numpy array in -> numpy array out
- H2OFrame in -> H2OFrame out
- pandas DataFrame in -> numpy array out (with a minor exception, aligning with sklearn's behavior)

This default behavior does not extend to transformer wrappers by default. Transformers are typically chained with other H2O transformers or estimators, so the `transform` method does not automatically convert its result back to ensure minimal unnecessary conversions within the pipeline.

To control the output format of a wrapper, you can use the `data_conversion` parameter, which accepts three possible values:

- `'auto'` (default for estimators): The result matches the input type, as described above.
- `True`: The result is always converted to a numpy array.
- `False` (default for transformers): The result remains unchanged and is returned as an H2OFrame.

For pipelines involving multiple H2O transformers before an estimator, if you prefer to consistently receive numpy arrays, you can adjust the pipeline slightly as follows:

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>H2OAutoML specifically for Classification and Regression tasks</b></div>

<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">H2OAutoMLClassifier</div>


In [61]:
import warnings
warnings.filterwarnings(action='ignore')

from sklearn import datasets
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import classification_report

from h2o.sklearn import H2OAutoMLClassifier

seed = 2020

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('polyfeat', PolynomialFeatures(degree=2)),
    ('featselect', SelectKBest(f_classif, k=5)),
    ('classifier', H2OAutoMLClassifier(max_models=10, seed=seed, sort_metric='aucpr'))
])

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [62]:
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.95      0.84      0.89        49
           1       0.92      0.98      0.95        94

    accuracy                           0.93       143
   macro avg       0.94      0.91      0.92       143
weighted avg       0.93      0.93      0.93       143



In [63]:
automl = pipeline.named_steps.classifier.estimator
automl.leaderboard

model_id,aucpr,auc,logloss,mean_per_class_error,rmse,mse
GBM_1_AutoML_2_20240718_194422,0.988367,0.983939,0.140175,0.0482167,0.192652,0.0371146
GBM_4_AutoML_2_20240718_194422,0.984707,0.980615,0.150177,0.0536168,0.200661,0.040265
GBM_3_AutoML_2_20240718_194422,0.9844,0.979589,0.158314,0.0566843,0.205944,0.0424128
GLM_1_AutoML_2_20240718_194422,0.983338,0.982691,0.135985,0.0413469,0.187043,0.0349851
GBM_2_AutoML_2_20240718_194422,0.982512,0.979146,0.154192,0.0547832,0.203697,0.0414925
StackedEnsemble_BestOfFamily_1_AutoML_2_20240718_194422,0.978455,0.980289,0.138337,0.0505493,0.190778,0.0363964
XGBoost_3_AutoML_2_20240718_194422,0.97635,0.978831,0.144529,0.0498146,0.196012,0.0384209
StackedEnsemble_AllModels_1_AutoML_2_20240718_194422,0.976227,0.979367,0.143074,0.0547832,0.194034,0.0376491
XGBoost_2_AutoML_2_20240718_194422,0.975378,0.977816,0.148309,0.0486482,0.193246,0.0373441
XRT_1_AutoML_2_20240718_194422,0.971628,0.973011,0.435327,0.0524505,0.20372,0.0415017


<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">H2OAutoMLRegressor</div>

In [64]:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from h2o.sklearn import H2OAutoMLRegressor

ds = datasets.fetch_california_housing()
seed = 2020
regressor = H2OAutoMLRegressor(max_models=10, max_runtime_secs_per_model=30, seed=seed)
grid = GridSearchCV(regressor, cv=2, param_grid=dict(
    monotone_constraints=[None, dict(AGE=1), dict(PTRATIO=1), dict(CRIM=-1)],
))

X = pd.DataFrame(ds.data, columns=ds.feature_names)
y = ds.target
grid.fit(X, y)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |█████████████████████████████████

In [65]:
best = grid.best_estimator_
grid.best_params_

{'monotone_constraints': {'PTRATIO': 1}}

In [66]:
grid.cv_results_

{'mean_fit_time': array([ 67.81097543, 134.56753922, 136.65167022, 132.85281551]),
 'std_fit_time': array([1.17112148, 3.5891149 , 1.76085472, 0.63982546]),
 'mean_score_time': array([2.22363043, 1.63450325, 1.5525403 , 1.62317908]),
 'std_score_time': array([0.34312129, 0.08246863, 0.00647855, 0.0957495 ]),
 'param_monotone_constraints': masked_array(data=[None, {'AGE': 1}, {'PTRATIO': 1}, {'CRIM': -1}],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'monotone_constraints': None},
  {'monotone_constraints': {'AGE': 1}},
  {'monotone_constraints': {'PTRATIO': 1}},
  {'monotone_constraints': {'CRIM': -1}}],
 'split0_test_score': array([0.61674607, 0.61507874, 0.61436981, 0.61606198]),
 'split1_test_score': array([0.59462251, 0.59710319, 0.59843931, 0.59542936]),
 'mean_test_score': array([0.60568429, 0.60609097, 0.60640456, 0.60574567]),
 'std_test_score': array([0.01106178, 0.00898777, 0.00796525, 0.01031631]),
 'rank_t

In [67]:
automl = best.estimator
automl.leaderboard

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
StackedEnsemble_BestOfFamily_1_AutoML_11_20240718_200037,0.479862,0.230268,0.315381,0.142553,0.230268
StackedEnsemble_AllModels_1_AutoML_11_20240718_200037,0.479975,0.230376,0.315479,0.142582,0.230376
DRF_1_AutoML_11_20240718_200037,0.490061,0.24016,0.328589,0.146337,0.24016
XRT_1_AutoML_11_20240718_200037,0.490908,0.240991,0.328876,0.146649,0.240991
DeepLearning_1_AutoML_11_20240718_200037,0.628485,0.394994,0.443833,,0.394994
DeepLearning_grid_3_AutoML_11_20240718_200037_model_1,0.710625,0.504987,0.534912,,0.504987
GLM_1_AutoML_11_20240718_200037,0.725083,0.525745,0.531309,,0.525745
DeepLearning_grid_2_AutoML_11_20240718_200037_model_1,0.744375,0.554094,0.563558,,0.554094
DeepLearning_grid_1_AutoML_11_20240718_200037_model_1,0.886278,0.785488,0.518116,,0.785488
DeepLearning_grid_1_AutoML_11_20240718_200037_model_2,1.49957,2.24872,0.530797,,2.24872


Few references I used:
https://github.com/h2oai/h2o-tutorials/tree/master

### Rest things stays the same as mentioned before that how to fetch the best model how to fetch the best parameters and how to train, All remains the same

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0f7a7a; overflow:hidden"><b>Auto EDA</b></div>

In [68]:
!pip install ydata-profiling
!pip install dataprep
!pip install dabl
!pip install autoviz
!pip install sweetviz

Collecting numpy<1.26,>=1.16.0 (from ydata-profiling)
  Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.4.1 requires cubinlinker, which is not installed.
cudf 24.4.1 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.4.1 requires ptxcompiler, which is not installed.
cuml 24.4.0 requires cupy-cuda11x>=12.0.0, which is no

In [69]:
import pandas as pd
df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.63,50,1
1,1,85,66,29,0,26.6,0.35,31,0
2,8,183,64,0,0,23.3,0.67,32,1
3,1,89,66,23,94,28.1,0.17,21,0
4,0,137,40,35,168,43.1,2.29,33,1


<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">Pandas Profiling</div>

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Indian Diabetes Report")
profile.to_file("Indian_Diabetes_report.html")

profile.to_notebook_iframe()

<div style="padding:10px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:10px;background-color:#215f95;;overflow:hidden;font-weight:501;font-family:magra">SweetViz</div>

In [73]:
import sweetviz as sv

report = sv.analyze(df)
report.show_html("sweetviz_Indian_Diabetes_report.html")
report.show_notebook()

                                             |          | [  0%]   00:00 -> (? left)

Report sweetviz_Indian_Diabetes_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


##### If you found this useful do support and if you find any flaws or find any solution to installing libaries like autosklearn and all do let me know in the comment as I am looking forward for a solution yet :)