# AutoML Benchmark - Titanic Disaster

This notebook presents a lot of Auto ML (Machine Learning) packages for classification task. We are going to perform an AutoML benchmark in [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic). In the end, I hope you are going to be able to apply these solutions in our projects.   

> **Summary** - AutoML Classification in real Data.   
> Content for intermediate level in Machine Learning and Data Science!   

<a id="ToC"></a>
## Table of Contents
- [Data Exploration](#data)
    - Label Encoder
    - Data Imputer
- [AutoML](#automl)
    - [Lazy Predict](#automl-lazypredict)
    - [hyperopt-sklearn](#automl-hyperopt)
    - [auto-sklearn](#automl-sklearn)
    - [TPOT](#automl-tpot)
    - [MLJAR](#automl-mljar)
    - [FLAML](#automl-flaml)
    - [AutoGluon](#automl-autogluon)
    - [H2O](#automl-h2o)
    - [AutoKeras](#automl-autokeras)
    - [MLBox](#automl-mlbox)
    - [PyCaret](#automl-pycaret)



In [1]:
%%capture
# Install main packages
!pip install numpy==1.11.0 scikit-learn==1.0.2
!pip install matplotlib==3.5.3 seaborn==0.11.2

In [2]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Restart kernel, for new downloaded packages
# -- it didn't work in Kaggle
# https://stackoverflow.com/questions/37751120/restart-ipython-kernel-with-a-command-from-a-cell

<a id="data"></a>

---
# Data Exploration

We are going to normalize the data (using an `Enconder`) and input missing data (using an `Imputer`).

[> Back to Table of Contents](#ToC)

In [3]:
# Main imports
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# List the files
import os

for dirname, _, filenames in os.walk("/kaggle/input"):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


In [5]:
# Train set
train = pd.read_csv("/kaggle/input/titanic/train.csv")
train.sort_values("PassengerId", inplace=True)
print("train", train.shape)
display(train.head(3))

# Test set
test = pd.read_csv("/kaggle/input/titanic/test.csv")
test.sort_values("PassengerId", inplace=True)
print("test", test.shape)
display(test.head(3))

# Submission
submission = pd.read_csv("/kaggle/input/titanic/gender_submission.csv")
submission.sort_values("PassengerId", inplace=True)
print("submission", submission.shape)
display(submission.head(3))

train (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


test (418, 11)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


submission (418, 2)


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0


## Label Enconder

Many algorithms cannot handle with `str` features. Thus we are going to encoder them into `int`/`float` features.

In [6]:
# Check train set
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


In [7]:
# Label encoder
from sklearn.preprocessing import LabelEncoder

# Encoding
pre_columns = ["Sex", "Embarked"]
encoders = {}
for c in pre_columns:
    encoder = LabelEncoder()
    train[c] = encoder.fit_transform(train[c].astype("str"))
    encoders[c] = encoder

## Input Missing Values

We have few columns with missing data, such as **Age** and **Cabin**. We are going to use an `Imputer` to fill in these empty fields.

In [8]:
# Check train set
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    int64  
dtypes: float64(2), int64(7), object(3)
memory usage: 90.5+ KB


In [9]:
# Data imputer
from sklearn.impute import KNNImputer

# Reference columns
x_columns = ["Pclass", "Sex", "SibSp", "Parch", "Fare", "Age", "Embarked"]
imputer = KNNImputer(n_neighbors=3, weights="uniform")
train[x_columns] = imputer.fit_transform(train[x_columns])

In [10]:
display(train[x_columns].info())
display(train[x_columns].sample(5))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    float64
 1   Sex       891 non-null    float64
 2   SibSp     891 non-null    float64
 3   Parch     891 non-null    float64
 4   Fare      891 non-null    float64
 5   Age       891 non-null    float64
 6   Embarked  891 non-null    float64
dtypes: float64(7)
memory usage: 55.7 KB


None

Unnamed: 0,Pclass,Sex,SibSp,Parch,Fare,Age,Embarked
432,2.0,0.0,1.0,0.0,26.0,42.0,2.0
686,3.0,1.0,4.0,1.0,39.6875,14.0,2.0
136,1.0,0.0,0.0,2.0,26.2833,19.0,2.0
834,3.0,1.0,0.0,0.0,8.3,18.0,2.0
552,3.0,1.0,0.0,0.0,7.8292,33.833333,1.0


In [11]:
def preprocessing(X:pd.DataFrame):
    """Preprocessing: label encoder and imputer"""
    global pre_columns, encoders
    global x_columns, imputer

    # label encoder, imputer
    for c in pre_columns: X[c] = encoders[c].transform(X[c].astype("str"))
    X[x_columns] = imputer.fit_transform(X[x_columns])
    return X

In [12]:
# Process test set, using the trained parameters
test = preprocessing(test)
test[x_columns]

Unnamed: 0,Pclass,Sex,SibSp,Parch,Fare,Age,Embarked
0,3.0,1.0,0.0,0.0,7.8292,34.500000,1.0
1,3.0,0.0,1.0,0.0,7.0000,47.000000,2.0
2,2.0,1.0,0.0,0.0,9.6875,62.000000,1.0
3,3.0,1.0,0.0,0.0,8.6625,27.000000,2.0
4,3.0,0.0,1.0,1.0,12.2875,22.000000,2.0
...,...,...,...,...,...,...,...
413,3.0,1.0,0.0,0.0,8.0500,29.666667,2.0
414,1.0,0.0,0.0,0.0,108.9000,39.000000,0.0
415,3.0,1.0,0.0,0.0,7.2500,38.500000,2.0
416,3.0,1.0,0.0,0.0,8.0500,29.666667,2.0


<a href="#ToC"><span class="label label-info" style="font-size: 125%">> Back to Table of Contents</span></a>

<a id="automl"></a>

---
# AutoML

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment. [Wikipedia](https://en.wikipedia.org/wiki/Automated_machine_learning)

[> Back to Table of Contents](#ToC)

In [13]:
# Store experiments results
experiment = {}
best_model_name = None

In [14]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_model(model_name, y_test, y_pred, focus_metric="F1"):
    '''Evaluate a model'''
    global experiment, best_model, best_model_name

    acc = accuracy_score(y_test, y_pred)
    pre = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1  = f1_score(y_test, y_pred)    
    experiment[model_name] = {"Acc":acc, "Pre":pre, "Rec":rec, "F1":f1}

    print(f"Accuracy : {acc:.4f}")
    print(f"Precision: {pre:.4f}")
    print(f"Recall   : {rec:.4f}")
    print(f"F1-score : {f1:.4f}")

    if not isinstance(best_model_name, tuple) or best_model_name[1] < experiment[model_name][focus_metric]:
        best_model_name = (model_name, experiment[model_name][focus_metric])

In [15]:
def create_submission(predict, X_test, submission, filename, pred_column="Survived"):
    '''Generate a submission file'''
    preds = predict(X_test)
    submission[pred_column] = preds
    submission[pred_column] = submission[pred_column].astype('int32')
    submission.to_csv(filename, index=False)

## Dataset

In [16]:
# Create X_train, y_train, and X_test
seed_number = 28
label_column = "Survived"
X_train, y_train = train[x_columns], train[label_column]
X_test = test[x_columns]

In [17]:
# Save for fast recovery (when restart the notebook)
np.savetxt('X_train.np', X_train, delimiter=';')
np.savetxt('y_train.np', y_train, delimiter=';')
np.savetxt('X_test.np', X_test, delimiter=';')

<a href="#ToC"><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id="automl-lazypredict"></a>

---
# Lazy Predict

- [GitHub](https://github.com/shankarpandala/lazypredict)
- [Documentation](https://lazypredict.readthedocs.io/)

**Lazy Predict** helps build a lot of basic models (from scikit-learn) without much code and helps understand which models works better without any parameter tuning.
In another words, Lazy Predict is a good library to test multiple solutions at once quickly. It only includes preprocessing and model training, thus we cannot perform any finetunning.

[> Back to Table of Contents](#ToC)

In [18]:
%%capture
# install Lazy Predict
!pip install lazypredict==0.2.12

In [19]:
%%time
from lazypredict.Supervised import LazyClassifier

automl = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
results, _ = automl.fit(X_train, X_train, y_train, y_train)

100%|██████████| 29/29 [00:02<00:00, 10.70it/s]

CPU times: user 6.25 s, sys: 921 ms, total: 7.17 s
Wall time: 4.91 s





In [20]:
# Experiments
display(results)

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RandomForestClassifier,0.98,0.98,0.98,0.98,0.26
DecisionTreeClassifier,0.98,0.98,0.98,0.98,0.02
ExtraTreeClassifier,0.98,0.98,0.98,0.98,0.02
ExtraTreesClassifier,0.98,0.98,0.98,0.98,0.26
BaggingClassifier,0.97,0.96,0.96,0.97,0.06
XGBClassifier,0.97,0.96,0.96,0.97,0.57
LGBMClassifier,0.95,0.94,0.94,0.95,0.27
LabelPropagation,0.92,0.9,0.9,0.92,0.11
LabelSpreading,0.91,0.9,0.9,0.91,0.14
KNeighborsClassifier,0.87,0.86,0.86,0.87,0.05


In [21]:
# Get best mode
cls_name = results.iloc[0].name
model = automl.models[cls_name]
# display(model)

In [22]:
# Evaluation
y_pred = model.predict(X_train)
evaluate_model("lazypredict", y_train, y_pred)

Accuracy : 0.9820
Precision: 0.9880
Recall   : 0.9649
F1-score : 0.9763


In [23]:
# Submission
create_submission(model.predict, X_test, submission, "submission-lazypredict.csv")

In [24]:
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)

Unnamed: 0,Acc,F1,Pre,Rec
lazypredict,0.98,0.98,0.99,0.96


### Discussion

Notes
* **Good** - It is simple to install and use
* **Good** - It presents nice results
* **Simple** - It runs the scikit-learn models

Features
* **Good** - It includes preprocessing and model trainning - only
* **Bad** - It does not include finetunning

Public Score - 0.75598

<a id="automl-hyperopt"></a>

---
# hyperopt-sklearn

- [GitHub](https://github.com/hyperopt/hyperopt-sklearn)
- [Documentation](http://hyperopt.github.io/hyperopt-sklearn/)

**hyperopt-sklearn** (hyperparameter optimization for sklearn) is [hyperopt](https://github.com/hyperopt/hyperopt)-based model selection among machine learning algorithms in scikit-learn.
In contrast with **Lazy Predict**, hyperopt-sklearn can perform hyperparameter tunning in the models.

[> Back to Table of Contents](#ToC)


_This code is written in markdown, because it uses incompatible packages for this Jupyter Notebook_

**Installation**

```python
%%capture
# install hyperopt-sklearn
!pip install git+https://github.com/hyperopt/hyperopt-sklearn
```

**Read data**

```python
# Read data
import numpy as np
import pandas as pd

X_train = np.loadtxt('X_train.np', delimiter=';')
y_train = np.loadtxt('y_train.np', delimiter=';')
X_test = np.loadtxt('X_test.np', delimiter=';')
```

**AutoML**

```python
%%time
from hpsklearn import HyperoptEstimator

automl = HyperoptEstimator(max_evals = 30, verbose=False)
automl.fit(X_train, y_train)
```

**Evaluation**

```python
# Evaluation
y_pred = automl.predict(X_train)
evaluate_model("hyperopt-sklearn", y_train, y_pred)
```

```sh
Accuracy : 0.8631
Precision: 0.8216
Recall   : 0.8216
F1-score : 0.8216
```

**Submission**

```python
# Submission
create_submission(automl.predict, X_test, submission, "submission-hyperopt-sklearn.csv")
```

```python
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)
```

### Discussion

Notes
* **Good** - It is simple to use
* **Good** - It presents nice results
* **Simple** - It runs the scikit-learn models
* **Bad** - It is complicate to install, contains few incompatible packages

Features
* **Good** - It includes model trainning and finetunning
* **Bad** - It does not include preprocessing

Public Score - 0.78229

<a id="automl-sklearn"></a>

---
# auto-sklearn

- [GitHub](https://github.com/automl/auto-sklearn)
- [Documentation](https://automl.github.io/auto-sklearn/master/)

**auto-sklearn** is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. In summary, it combines data preprocessing, feature preprocessing and classifier evaluation. Note, it not train simple models, auto-sklearn enseble models to get better performance. _Note, it uses the previous version of scikit-learn_.

[> Back to Table of Contents](#ToC)

_This code is written in markdown, because it uses incompatible packages for this Jupyter Notebook_

**Installation**

```python
# %%capture
# install dependencies
!apt-get -y remove swig
!apt-get -y install swig3.0 build-essential -y
!ln -s /usr/bin/swig3.0 /usr/bin/swig
!apt-get -y install build-essential
# install auto-sklearn
!pip install scikit-learn==0.24.2
!pip install git+https://github.com/automl/auto-sklearn
```

**Read data**

```python
# Read data
import numpy as np
import pandas as pd

X_train = np.loadtxt('X_train.np', delimiter=';')
y_train = np.loadtxt('y_train.np', delimiter=';')
X_test = np.loadtxt('X_test.np', delimiter=';')
```

**AutoML**

```python
%%time
from autosklearn.classification import AutoSklearnClassifier

automl = AutoSklearnClassifier(time_left_for_this_task=150, ensemble_kwargs = {'ensemble_size': 5}, seed=28)
automl.fit(X_train, y_train)
```

**Evaluation**

```python
# Evaluation
y_pred = automl.predict(X_train)
evaluate_model("auto-sklearn", y_train, y_pred)
```

```sh
Accuracy : 0.8698
Precision: 0.8576
Recall   : 0.7924
F1-score : 0.8237
```

**Submission**

```python
create_submission(automl.predict, X_test, submission, "submission-auto-sklearn.csv")
```


### Discussion

Notes
* **Good** - It is simple to use
* **Good** - It presents nice results
* **Simple** - It runs the scikit-learn models
* **Bad** - It uses the outdated version of scikit-learn

Features
* **Good** - It includes preprocessing, model trainning and emsemble
* **Bad** - It generates complicated emsemble models (combining over 10 models into a meta-model)

Public Score - 0.74401

<a id="automl-tpot"></a>

---
# TPOT

- [GitHub](https://github.com/EpistasisLab/tpot)
- [Documentation](http://epistasislab.github.io/tpot/)

**TPOT** stands for Tree-based Pipeline Optimization Tool. It is a AutoML tool that optimizes machine learning pipelines using genetic programming.
It looks a combination of **hyperopt-sklearn** finetunning and **auto-sklearn** data preprocessing, however, it does not emsemble the models - it keeps models simple and interpretable.

[> Back to Table of Contents](#ToC)

In [25]:
%%capture
# install TPOT
!pip install deap update_checker tqdm stopit xgboost
!pip install scikit-mdr skrebate
!pip install tpot==0.11.7

In [26]:
%%time
from tpot import TPOTClassifier

automl = TPOTClassifier(generations=5, population_size=20, verbosity=0, random_state=28)
automl.fit(X_train, y_train)

CPU times: user 3min 36s, sys: 44.9 s, total: 4min 21s
Wall time: 3min


TPOTClassifier(generations=5, population_size=20, random_state=28)

In [27]:
# Evaluation
y_pred = automl.predict(X_train)
evaluate_model("tpot", y_train, y_pred)

Accuracy : 0.8833
Precision: 0.8864
Recall   : 0.7982
F1-score : 0.8400


In [28]:
# Submission
create_submission(automl.predict, X_test, submission, "submission-tpot.csv")

In [29]:
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)

Unnamed: 0,Acc,Pre,Rec,F1
lazypredict,0.98,0.99,0.96,0.98
tpot,0.88,0.89,0.8,0.84


### Discussion

Notes
* **Good** - It is simple to install and use
* **Good** - It presents nice results
* **Simple** - It runs the scikit-learn models

Features
* **Good** - It includes preprocessing, model trainning and finetunning

Public Score - 0.76555 (good)

<a id="automl-mljar"></a>

---
# MLJAR

- [GitHub](https://github.com/mljar/mljar-supervised)
- [Documentation](https://supervised.mljar.com/)

The **mljar-supervised** is an AutoML package that works with tabular data. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model. Also, it supports explainability and automatic exploratory data analysis. It contains more features than **TPOT**, but it is more complex too.

[> Back to Table of Contents](#ToC)

In [30]:
%%capture
# install MLJar
!pip install mljar-supervised==0.11.3

In [31]:
%%time
from supervised.automl import AutoML

automl = AutoML(mode="Compete", total_time_limit=1*60)
automl.fit(X_train, y_train)

AutoML directory: AutoML_1
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree logloss 0.397652 trained in 0.81 seconds
Disable stacking for split validation
* Step simple_algorithms will try to check up to 3 models
2_DecisionTree logloss 0.625735 trained in 1.06 seconds
3_DecisionTree logloss 0.625735 trained in 1.07 seconds
4_Linear logloss 0.403434 trained in 2.33 seconds
* Step default_algorithms will tr

AutoML(mode='Compete', total_time_limit=60)

In [32]:
# Evaluation
y_pred = automl.predict(X_train)
evaluate_model("mljar", y_train, y_pred)

Accuracy : 0.8418
Precision: 0.8551
Recall   : 0.7076
F1-score : 0.7744


In [33]:
# Submission
create_submission(automl.predict, X_test, submission, "submission-mljar.csv")

In [34]:
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)

Unnamed: 0,Acc,Pre,Rec,F1
lazypredict,0.98,0.99,0.96,0.98
tpot,0.88,0.89,0.8,0.84
mljar,0.84,0.86,0.71,0.77


### Discussion

Notes
* **Good** - It is simple to install and use
* **Good** - It presents nice results
* **Simple** - Many of the models are Tree-based
* **Simple** - It runs the scikit-learn models

Features
* **Good** - It includes preprocessing, model trainning and finetunning
* **Good** - It contains explainability features

Public Score - 0.78229 (good)

<a id="automl-flaml"></a>

---
# FLAML

- [GitHub](https://github.com/microsoft/FLAML)
- [Documentation](https://microsoft.github.io/FLAML/)

**FLAML** is a library that finds accurate machine learning models automatically. It frees users from selecting learners and hyperparameters for each learner. It can also be used to tune generic hyperparameters for MLOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations and so on. It contains many features as **MLJAR**, without explainability.

[> Back to Table of Contents](#ToC)

In [35]:
%%capture
# install FLAML
!pip install flaml[notebook]==1.0.12

In [36]:
%%time
from flaml import AutoML

automl = AutoML()
automl.fit(X_train, y_train, max_iter=500,
           task="classification", metric="micro_f1")

[flaml.automl: 10-10 18:23:57] {2600} INFO - task = classification
[flaml.automl: 10-10 18:23:57] {2602} INFO - Data split method: stratified
[flaml.automl: 10-10 18:23:57] {2605} INFO - Evaluation method: holdout
[flaml.automl: 10-10 18:23:57] {2727} INFO - Minimizing error metric: 1-micro_f1
[flaml.automl: 10-10 18:23:57] {2869} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 10-10 18:23:57] {3174} INFO - iteration 0, current learner lgbm
[flaml.automl: 10-10 18:23:57] {3308} INFO - Estimated sufficient time budget=934s. Estimated necessary time budget=23s.
[flaml.automl: 10-10 18:23:57] {3360} INFO -  at 0.1s,	estimator lgbm's best error=0.2418,	best estimator lgbm's best error=0.2418
[flaml.automl: 10-10 18:23:57] {3174} INFO - iteration 1, current learner lgbm
[flaml.automl: 10-10 18:23:57] {3360} INFO -  at 0.2s,	estimator lgbm's best error=0.2418,	best estimator lgbm's best error=0.2418
[flaml

CPU times: user 1min 38s, sys: 5.77 s, total: 1min 44s
Wall time: 1min 7s


In [37]:
# Best model
print(automl.model.estimator)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=[],
              colsample_bylevel=1.0, colsample_bynode=1, colsample_bytree=1.0,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='lossguide',
              importance_type=None, interaction_constraints='',
              learning_rate=0.7895542070824232, max_bin=256,
              max_cat_to_onehot=4, max_delta_step=0, max_depth=0, max_leaves=15,
              min_child_weight=7.110058659500221, missing=nan,
              monotone_constraints='()', n_estimators=6, n_jobs=-1,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0.05626297533035684, reg_lambda=10.184858367611008, ...)


In [38]:
# Evaluation
y_pred = automl.predict(X_train)
evaluate_model("flaml", y_train, y_pred)

Accuracy : 0.8530
Precision: 0.8482
Recall   : 0.7515
F1-score : 0.7969


In [39]:
# Submission
create_submission(automl.predict, X_test, submission, "submission-flaml.csv")

In [40]:
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)

Unnamed: 0,Acc,Pre,Rec,F1
lazypredict,0.98,0.99,0.96,0.98
tpot,0.88,0.89,0.8,0.84
mljar,0.84,0.86,0.71,0.77
flaml,0.85,0.85,0.75,0.8


### Discussion

Notes
* **Good** - It is simple to install and use
* **Good** - It presents nice results
* **Bad** - Only contains a few models
* **Interesting** - It has a version for .NET

Features
* **Good** - It includes preprocessing, model trainning and finetunning
* **Good** - It contains text processing and online learning

Public Score - 0.78468 (best)

<a id="automl-autogluon"></a>

---
# AutoGluon

- [GitHub](https://github.com/awslabs/autogluon)
- [Documentation](https://auto.gluon.ai/)

**AutoGluon** automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models on image, text, time series, and tabular data. _It is the most robust and generic AutoML tool so far_.

[> Back to Table of Contents](#ToC)

In [41]:
%%capture
# install AutoGluon
!pip install autogluon==0.5.2

In [42]:
%%time
from autogluon.tabular import TabularPredictor

automl = TabularPredictor(label=label_column, path="titanic-autogluon",
                           eval_metric="f1_micro", verbosity=1)
automl.fit(train.drop(columns=['PassengerId']))

AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
AutoGluon will gauge predictive performance using evaluation metric: 'f1_micro'
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
		2 different `eval_metric` are provided.  Use the one in constructor or `set_params` instead.
Detailed Traceback:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/autogluon/core/trainer/abstract_trainer.py", lin

CPU times: user 32.3 s, sys: 1.86 s, total: 34.1 s
Wall time: 24.5 s


<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7fc69959c490>

In [43]:
# Generates a complete report about the models
_ = automl.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0        NeuralNetTorch       0.85           0.02      4.24                    0.02               4.24            1       True         11
1   WeightedEnsemble_L2       0.85           0.03      5.59                    0.00               1.34            2       True         13
2       NeuralNetFastAI       0.84           0.04      1.93                    0.04               1.93            1       True         10
3         LightGBMLarge       0.83           0.01      2.64                    0.01               2.64            1       True         12
4              LightGBM       0.82           0.01      2.05                    0.01               2.05            1       True          4
5      RandomForestGini       0.82           0.21      1.17                    0.21               1.17        

In [44]:
# Evaluation
y_pred = automl.predict(train.drop(columns=['PassengerId']))
evaluate_model("auto-gluon", y_train, y_pred)

Accuracy : 0.9012
Precision: 0.8969
Recall   : 0.8392
F1-score : 0.8671


In [45]:
# Submission
create_submission(automl.predict, test.drop(columns=['PassengerId']), submission, "submission-auto-gluon.csv")

In [46]:
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)

Unnamed: 0,Acc,Pre,Rec,F1
lazypredict,0.98,0.99,0.96,0.98
tpot,0.88,0.89,0.8,0.84
mljar,0.84,0.86,0.71,0.77
flaml,0.85,0.85,0.75,0.8
auto-gluon,0.9,0.9,0.84,0.87


### Discussion

Notes
* **Good** - It is simple to install and use
* **Good** - It presents amazing results

Features
* **Good** - It includes preprocessing, model trainning and finetunning
* **Good** - It contains text, image and multimodal processing techniques

Public Score - 0.76076

<a id="automl-h2o"></a>

---
# H2O

- [GitHub](https://github.com/h2oai/h2o-3)
- [Documentation](https://h2o.ai/)

**H2O** is an in-memory platform for distributed, scalable machine learning. It provides implementations of many popular algorithms; it didn't use scikit-learn implementations, it contains its own implementations.

[> Back to Table of Contents](#ToC)

In [47]:
%%capture
# install H20
!pip install h2o==3.36.1.4

In [48]:
%%time
import h2o
from h2o.automl import H2OAutoML

# Start the H2O cluster (locally)
h2o.init()
# Get H2O data frame
hf_train = h2o.H2OFrame(train)
hf_test  = h2o.H2OFrame(test)
hf_train[label_column] = hf_train[label_column].asfactor()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.16" 2022-07-19; OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04); OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpdwyktum7
  JVM stdout: /tmp/tmpdwyktum7/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpdwyktum7/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.1.4
H2O_cluster_version_age:,2 months and 7 days
H2O_cluster_name:,H2O_from_python_unknownUser_oq2gwv
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,4 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
CPU times: user 258 ms, sys: 88.8 ms, total: 347 ms
Wall time: 10.8 s


In [49]:
# AutoML for 10 base models
automl = H2OAutoML(max_models=10, seed=seed_number)
automl.train(x=x_columns, y=label_column, training_frame=hf_train)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_3_AutoML_1_20221010_182717


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,40.0,40.0,21023.0,7.0,8.0,7.97,23.0,52.0,37.12




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.07409565800317175
RMSE: 0.2722051762975343
LogLoss: 0.262592101090439
Mean Per-Class Error: 0.09452593231713163
AUC: 0.9640840869630055
AUCPR: 0.9540204588442834
Gini: 0.928168173926011

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3457409995979469: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,503.0,46.0,0.0838,(46.0/549.0)
1,1,36.0,306.0,0.1053,(36.0/342.0)
2,Total,539.0,352.0,0.092,(82.0/891.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.35,0.88,207.0
1,max f2,0.29,0.9,224.0
2,max f0point5,0.56,0.91,159.0
3,max accuracy,0.5,0.91,171.0
4,max precision,0.99,1.0,0.0
5,max recall,0.07,1.0,356.0
6,max specificity,0.99,1.0,0.0
7,max absolute_mcc,0.5,0.81,171.0
8,max min_per_class_accuracy,0.33,0.9,214.0
9,max mean_per_class_accuracy,0.34,0.91,210.0



Gains/Lift Table: Avg response rate: 38.38 %, avg score: 38.19 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.01,0.98,2.61,2.61,1.0,0.98,1.0,0.98,0.03,0.03,160.53,160.53,0.03
1,2,0.02,0.98,2.61,2.61,1.0,0.98,1.0,0.98,0.03,0.05,160.53,160.53,0.05
2,3,0.03,0.98,2.61,2.61,1.0,0.98,1.0,0.98,0.03,0.08,160.53,160.53,0.08
3,4,0.04,0.97,2.61,2.61,1.0,0.97,1.0,0.98,0.03,0.11,160.53,160.53,0.11
4,5,0.05,0.97,2.61,2.61,1.0,0.97,1.0,0.98,0.03,0.13,160.53,160.53,0.13
5,6,0.1,0.94,2.61,2.61,1.0,0.95,1.0,0.97,0.13,0.26,160.53,160.53,0.26
6,7,0.15,0.9,2.61,2.61,1.0,0.92,1.0,0.95,0.13,0.39,160.53,160.53,0.39
7,8,0.2,0.84,2.61,2.61,1.0,0.87,1.0,0.93,0.13,0.52,160.53,160.53,0.52
8,9,0.3,0.63,2.2,2.47,0.84,0.75,0.95,0.87,0.22,0.74,119.54,146.92,0.72
9,10,0.4,0.34,1.58,2.25,0.61,0.48,0.86,0.77,0.16,0.9,58.07,124.77,0.81




ModelMetricsBinomial: gbm
** Reported on cross-validation data. **

MSE: 0.12242024427174984
RMSE: 0.3498860446941973
LogLoss: 0.40192205298777334
Mean Per-Class Error: 0.17066915923689002
AUC: 0.8767882060950798
AUCPR: 0.8628759250766063
Gini: 0.7535764121901596

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.42156451330301314: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,482.0,67.0,0.122,(67.0/549.0)
1,1,75.0,267.0,0.2193,(75.0/342.0)
2,Total,557.0,334.0,0.1594,(142.0/891.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.42,0.79,193.0
1,max f2,0.23,0.81,259.0
2,max f0point5,0.68,0.83,120.0
3,max accuracy,0.51,0.85,165.0
4,max precision,0.99,1.0,0.0
5,max recall,0.02,1.0,397.0
6,max specificity,0.99,1.0,0.0
7,max absolute_mcc,0.5,0.67,168.0
8,max min_per_class_accuracy,0.33,0.81,220.0
9,max mean_per_class_accuracy,0.42,0.83,193.0



Gains/Lift Table: Avg response rate: 38.38 %, avg score: 38.00 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.01,0.98,2.61,2.61,1.0,0.98,1.0,0.98,0.03,0.03,160.53,160.53,0.03
1,2,0.02,0.97,2.61,2.61,1.0,0.98,1.0,0.98,0.03,0.05,160.53,160.53,0.05
2,3,0.03,0.96,2.61,2.61,1.0,0.97,1.0,0.98,0.03,0.08,160.53,160.53,0.08
3,4,0.04,0.95,2.61,2.61,1.0,0.96,1.0,0.97,0.03,0.11,160.53,160.53,0.11
4,5,0.05,0.95,2.61,2.61,1.0,0.95,1.0,0.97,0.03,0.13,160.53,160.53,0.13
5,6,0.1,0.91,2.61,2.61,1.0,0.93,1.0,0.95,0.13,0.26,160.53,160.53,0.26
6,7,0.15,0.86,2.19,2.47,0.84,0.89,0.95,0.93,0.11,0.37,119.08,146.92,0.36
7,8,0.2,0.79,2.32,2.43,0.89,0.83,0.93,0.91,0.12,0.49,131.58,143.06,0.47
8,9,0.3,0.58,1.93,2.27,0.74,0.7,0.87,0.84,0.19,0.68,93.2,126.5,0.62
9,10,0.4,0.37,1.17,1.99,0.45,0.47,0.76,0.75,0.12,0.8,17.09,99.23,0.65




Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,accuracy,0.85,0.05,0.87,0.86,0.77,0.9,0.85
1,auc,0.87,0.03,0.89,0.88,0.82,0.9,0.87
2,err,0.15,0.05,0.13,0.14,0.23,0.1,0.15
3,err_count,26.6,8.62,23.0,25.0,41.0,18.0,26.0
4,f0point5,0.81,0.07,0.84,0.82,0.69,0.88,0.82
5,f1,0.81,0.05,0.83,0.83,0.72,0.85,0.81
6,f2,0.81,0.03,0.82,0.84,0.75,0.83,0.8
7,lift_top_group,2.61,0.12,2.63,2.44,2.62,2.78,2.58
8,logloss,0.41,0.06,0.38,0.38,0.5,0.35,0.42
9,max_per_class_error,0.19,0.03,0.19,0.15,0.24,0.19,0.2



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
0,,2022-10-10 18:27:40,1.226 sec,0.0,0.49,0.67,0.5,0.38,1.0,0.62
1,,2022-10-10 18:27:40,1.253 sec,5.0,0.4,0.51,0.92,0.91,2.61,0.14
2,,2022-10-10 18:27:40,1.285 sec,10.0,0.35,0.42,0.94,0.92,2.61,0.12
3,,2022-10-10 18:27:40,1.320 sec,15.0,0.33,0.37,0.94,0.93,2.61,0.11
4,,2022-10-10 18:27:40,1.353 sec,20.0,0.31,0.33,0.95,0.94,2.61,0.1
5,,2022-10-10 18:27:40,1.384 sec,25.0,0.3,0.31,0.96,0.94,2.61,0.1
6,,2022-10-10 18:27:40,1.424 sec,30.0,0.28,0.29,0.96,0.95,2.61,0.1
7,,2022-10-10 18:27:40,1.451 sec,35.0,0.28,0.27,0.96,0.95,2.61,0.09
8,,2022-10-10 18:27:40,1.480 sec,40.0,0.27,0.26,0.96,0.95,2.61,0.09



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Sex,240.05,1.0,0.33
1,Age,175.09,0.73,0.24
2,Fare,130.63,0.54,0.18
3,Pclass,114.83,0.48,0.16
4,SibSp,30.2,0.13,0.04
5,Embarked,22.46,0.09,0.03
6,Parch,16.64,0.07,0.02




In [50]:
# Sort the evaluated models
automl.leaderboard

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
GBM_3_AutoML_1_20221010_182717,0.876788,0.401922,0.862876,0.170669,0.349886,0.12242
StackedEnsemble_BestOfFamily_1_AutoML_1_20221010_182717,0.874767,0.406227,0.858213,0.17913,0.352354,0.124153
StackedEnsemble_AllModels_1_AutoML_1_20221010_182717,0.872613,0.408971,0.858323,0.176325,0.35257,0.124305
GBM_2_AutoML_1_20221010_182717,0.871582,0.409687,0.857846,0.176277,0.354591,0.125735
XGBoost_3_AutoML_1_20221010_182717,0.87131,0.417834,0.847479,0.180208,0.355459,0.126351
GBM_4_AutoML_1_20221010_182717,0.870219,0.415496,0.852784,0.176349,0.356529,0.127113
XGBoost_2_AutoML_1_20221010_182717,0.868117,0.424171,0.841044,0.191688,0.361474,0.130663
DRF_1_AutoML_1_20221010_182717,0.861396,0.706447,0.831632,0.193917,0.370478,0.137254
XGBoost_1_AutoML_1_20221010_182717,0.860491,0.433248,0.834473,0.202641,0.368405,0.135722
XRT_1_AutoML_1_20221010_182717,0.858624,0.456702,0.813863,0.192671,0.379698,0.144171




In [51]:
# Evaluation
y_pred = automl.predict(hf_train).as_data_frame()
y_pred = y_pred['predict'].tolist()
evaluate_model("h2o", y_train, y_pred)

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Accuracy : 0.9068
Precision: 0.8689
Recall   : 0.8918
F1-score : 0.8802


In [52]:
# Submission
def h2o_predict(hf_dataset):
    y_pred = automl.predict(hf_dataset).as_data_frame()
    y_pred = y_pred['predict'].tolist()
    return y_pred

create_submission(h2o_predict, hf_test, submission, "submission-h2o.csv")

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [53]:
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)

Unnamed: 0,Acc,Pre,Rec,F1
lazypredict,0.98,0.99,0.96,0.98
tpot,0.88,0.89,0.8,0.84
mljar,0.84,0.86,0.71,0.77
flaml,0.85,0.85,0.75,0.8
auto-gluon,0.9,0.9,0.84,0.87
h2o,0.91,0.87,0.89,0.88


### Discussion

Notes
* **Bad** - It is not simple to install and use
* **Bad** - It is not scikit-learn like
* **Bad** - It does not use `pandas.DataFrame`, it uses `H2OFrame`
* **Good** - It presents amazing results

Features
* **Good** - It includes preprocessing, model trainning and finetunning

Public Score - 0.73444 (worse)

<a id="automl-autokeras"></a>

---
# AutoKeras

- [GitHub](https://github.com/keras-team/autokeras)
- [Documentation](https://autokeras.com/)

AutoKeras ia an AutoML system based on Keras, a Deep Learning library. AutoKeras can perform AutoML into different types of data, such as tabular, image or even text. Deep Learning models need tabular values as `int` or `float`, however can extract intrinsic features from the data - it tends to be good with real data.

[> Back to Table of Contents](#ToC)

In [54]:
%%capture
# Install AutoKeras
!pip install autokeras==1.0.20

In [55]:
%%time
import tensorflow as tf
import autokeras as ak

automl = ak.StructuredDataClassifier(overwrite=True, max_trials=10)
automl.fit(X_train, y_train,epochs=15)

Trial 10 Complete [00h 00m 08s]
val_accuracy: 0.6387096643447876

Best val_accuracy So Far: 0.8774193525314331
Total elapsed time: 00h 01m 19s
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
CPU times: user 1min 29s, sys: 4.06 s, total: 1min 33s
Wall time: 1min 36s


<keras.callbacks.History at 0x7fc54c404fd0>

In [56]:
# Evaluation
y_pred = automl.predict(X_train)
evaluate_model("auto-keras", y_train, y_pred)

Accuracy : 0.8384
Precision: 0.8750
Recall   : 0.6754
F1-score : 0.7624


In [57]:
# Submission
create_submission(automl.predict, X_test, submission, "submission-auto-keras.csv")



In [58]:
# Results
pd.DataFrame(experiment).T.style.highlight_max(axis=0)

Unnamed: 0,Acc,Pre,Rec,F1
lazypredict,0.98,0.99,0.96,0.98
tpot,0.88,0.89,0.8,0.84
mljar,0.84,0.86,0.71,0.77
flaml,0.85,0.85,0.75,0.8
auto-gluon,0.9,0.9,0.84,0.87
h2o,0.91,0.87,0.89,0.88
auto-keras,0.84,0.88,0.68,0.76


### Discussion

Notes
* **Good** - It is simple to install and use
* **Good** - It presents good results
* **Good** - It produces Deep Learning models

Features
* **Good** - It includes preprocessing and model trainning
* **Good** - It can process tabular data, image and text

Public Score - 0.77033 (good)

<a id="automl-mlbox"></a>

---
# MLBox

- [Example](https://www.kaggle.com/code/axelderomblay/running-mlbox-auto-ml-package-on-titanic/notebook)
- [GitHub](https://github.com/AxeldeRomblay/MLBox)
- [Documentation](https://mlbox.readthedocs.io/en/latest/)

MLBox is a AutoML library that provides data preprocessing, feature selection, hyperparameter optimization and model evaluation. It contains several state-of-art competition models, such as Deep Learning, Stacking, LightGBM. Also, a few interpretation functions to analyze our results.

[> Back to Table of Contents](#ToC)

_This code is written in markdown, because it uses incompatible packages for this Jupyter Notebook_

**Installation**

```python
%%capture
# Downgrade main packages
!pip install numpy==1.18.2 pandas==0.25.3
!pip install scikit-learn==0.22.1 tensorflow==2.0.0 
# Install MLBox
!pip install mlbox==0.8.5
```

**Init**

```python
# Init MLBox
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
```

**Read data**

```python
# Data infos
target_name = "Survived"
paths = ["../input/titanic/train.csv","../input/titanic/test.csv"]

# Reading and Preprocessing Data
rd = Reader(sep = ",")
df = rd.train_test_split(paths, target_name)
```

**Preprocessing**

```python
# Auto removing unnecessary attributes
dft = Drift_thresholder()
df = dft.fit_transform(df)
```

**AutoML**

```python
space = {
    'est__strategy':{"search":"choice", "space":["LightGBM"]},    
    'est__n_estimators':{"search":"choice", "space":[150]},    
    'est__colsample_bytree':{"search":"uniform", "space":[0.8,0.95]},
    'est__subsample':{"search":"uniform", "space":[0.8,0.95]},
    'est__max_depth':{"search":"choice", "space":[5,6,7,8,9]},
    'est__learning_rate':{"search":"choice", "space":[0.07]} 
}

# AutoML
opt = Optimiser(scoring = "accuracy", n_folds = 5)
params = opt.optimise(space, df, 15)
```

**Submission**

```python
# Making Predictions
prd = Predictor()
prd.fit_predict(params, df)

# Reading the Results
submit = pd.read_csv("../input/titanic/gender_submission.csv", sep=',')
preds = pd.read_csv("save/"+target_name+"_predictions.csv")

# Generating Submission
submit[target_name] = preds[target_name+"_predicted"].values
submit.to_csv("submission-mlbox.csv", index=False)
```

### Discussion

Notes
* **Good** - It is simple to install
* **Bad** - It is complex to use; far way from scikit-learn like
* **Bad** - It did not work in Kaggle env (2022-08-25); but it worked in Google Colab

Features
* **Good** - It includes preprocessing and model trainning

Public Score - 0.78708 (good)

<a id="automl-pycaret"></a>

---
# PyCaret

- [GitHub](https://github.com/pycaret/pycaret)
- [Documentation](https://pycaret.org/)

**PyCaret** is a low-code machine learning library that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive; AutoML is just one of the features of this package.

[> Back to Table of Contents](#ToC)

_This code is written in markdown, because it uses incompatible packages for this Jupyter Notebook_

**Installation**

```python
%%capture
# install pycaret
!pip install llvmlite --ignore-installed
!pip install pycaret==2.3.10
!pip install numpy==1.20.3
```

**Read data**

```python
# Read data
import numpy as np
import pandas as pd

train = pd.read_csv("/kaggle/input/titanic/train.csv")
test  = pd.read_csv("/kaggle/input/titanic/test.csv")
```

**AutoML**

```python
%%time
from pycaret import classification

s = classification.setup(train, target = 'Survived')
```

**Evaluation**

```python
# Evaluation
best = classification.compare_models()
classification.plot_model(best)
```

**Submission**

```python
# Submission
submission = classification.predict_model(best, data=test)
submission = submission[['PassengerId', 'Label']].rename(columns={'Label': 'Survived'})
submission.to_csv("submission-pycaret.csv", index=False)
```


### Discussion

Notes
* **Good** - It is simple to install and use
* **Good** - It is a complete tool, with different functionalities for ML-based production
* **Bad** - It did not work in Kaggle env (2022-08-25); but it worked in Google Colab

Features
* **Good** - It includes preprocessing and model trainning; of complex models
* **Good** - It includes clustering and anomaly detection also

Public Score - 0.78229 (good)