# 🚀🚀🚀 Fast AutoML with AutoGluon and Intel® Extension for Scikit-learn* - Kaggle Tabular Playground Series - October 2021

AutoML significantly simplifies building of high quality models but sometimes has insufficient performance, especially for big problems. In this notebook, we will show how to accelerate AutoML framework [AutoGluon](https://github.com/awslabs/autogluon) using [**Intel® Extension for Scikit-learn***](https://github.com/intel/scikit-learn-intelex) which speedups Scikit-learn's algorithms in seamless way with one pip package installation and two lines of code.

This notebook solves binary classification task, but you can use it as template for many other competitions with few changes depending on task type (multiclass or regression) and your needs.

I will show you how to **speed up** your kernel and get predictions with **better quality** using **Intel® Extension for Scikit-learn**.

### AutoGluon installation:

In [None]:
!pip install autogluon.tabular[all] -q --progress-bar off

### Intel® Extension for Scikit-learn installation:

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

### Reading data

In [None]:
import pandas as pd

id_column = 'id'
train_data = pd.read_csv("/kaggle/input/tabular-playground-series-oct-2021/train.csv", index_col=id_column)
test_data = pd.read_csv("/kaggle/input/tabular-playground-series-oct-2021/test.csv", index_col=id_column)
submission = pd.read_csv("/kaggle/input/tabular-playground-series-oct-2021/sample_submission.csv", index_col=id_column)

In [None]:
train_data[:5]

In [None]:
train_data.info()

### Reduce DataFrame memory usage

Since data is quite big for Kaggle notebook instance RAM, we need to reduce memory usage by switching data types.

In [None]:
label = 'target'
features = [col for col in train_data.columns if 'f' in col]

cont_features = []
disc_features = []

for col in features:
    if train_data[col].dtype=='float64':
        cont_features.append(col)
    else:
        disc_features.append(col)

train_data[cont_features] = train_data[cont_features].astype('float32')
train_data[disc_features] = train_data[disc_features].astype('uint8')
train_data[cont_features] = train_data[cont_features].astype('float32')
train_data[disc_features] = train_data[disc_features].astype('uint8')

In [None]:
train_data.info()

Memory usage was reduced from 2.1 GB to 974 MB

# AutoGluon with Intel® Extension for Scikit-learn

Run just two lines of code to accelerate Scikit-learn:

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

Enable logging with INFO level to track usage of sklearnex:

In [None]:
import logging

logger = logging.getLogger()
fh = logging.FileHandler('log.txt')
fh.setLevel(10)
logger.addHandler(fh)

In [None]:
from sklearn.model_selection import train_test_split

random_state = 42
train_data, valid_data = train_test_split(train_data, test_size=0.1, random_state=random_state)

Collect garbage to reduce memory usage

In [None]:
import gc

gc.collect()

In [None]:
from autogluon.tabular import TabularPredictor


# use only Gradient Boosting and Random Forest to reduce execution time
hyperparameters = {
    'GBM': [
        {'extra_trees': True, 'seed': random_state, 'ag_args': {'name_suffix': 'XT'}},
        {}
    ],
    'RF': [
        {'criterion': 'gini', 'random_state': random_state, 'max_features': 'log2', 'n_estimators': 500,
         'ag_args': {'name_suffix': 'Gini_Log2', 'problem_types': ['binary']},
         'ag_args_fit': {'use_daal': True}},
        {'criterion': 'gini', 'random_state': random_state, 'max_features': 'sqrt', 'n_estimators': 500,
         'ag_args': {'name_suffix': 'Gini_Sqrt', 'problem_types': ['binary']},
         'ag_args_fit': {'use_daal': True}},
        {'criterion': 'gini', 'random_state': random_state, 'max_features': (train_data.shape[1] - 1) // 8, 'n_estimators': 500,
         'ag_args': {'name_suffix': 'Gini_Div8', 'problem_types': ['binary']},
         'ag_args_fit': {'use_daal': True}}
    ]
}

autogluon_predictor = TabularPredictor(
    label=label,
    eval_metric="roc_auc",
    learner_kwargs={'ignored_columns': [id_column]}
).fit(
    train_data=train_data,
    hyperparameters=hyperparameters,
    verbosity=2
)

In [None]:
leaderboard = autogluon_predictor.leaderboard(valid_data)
leaderboard

In [None]:
predictions = autogluon_predictor.predict_proba(test_data)
submission.target = predictions.iloc[:, 1]
submission[:5]

In [None]:
submission.to_csv("submission.csv")

In [None]:
logger.removeHandler(fh)

In [None]:
!rm -rf AutogluonModels

### List of algorithms which are accelerated by sklearnex

In [None]:
!cat log.txt | grep 'running accelerated version' | sort | uniq

# Conclusions

Intel® Extension for Scikit-learn gives you opportunities to:

* Use your Scikit-learn code for training and inference without modification.
* Get speed up your kernel
* Get predictions of the better quality as the other tested frameworks.

*Please upvote if you liked it.*

# Other notebooks with sklearnex usage

### [[predict sales] Stacking with scikit-learn-intelex](https://www.kaggle.com/alexeykolobyanin/predict-sales-stacking-with-scikit-learn-intelex)

### [[TPS-Aug] NuSVR with Intel Extension for Sklearn](https://www.kaggle.com/alexeykolobyanin/tps-aug-nusvr-with-intel-extension-for-sklearn)

### [Using scikit-learn-intelex for What's Cooking](https://www.kaggle.com/kppetrov/using-scikit-learn-intelex-for-what-s-cooking?scriptVersionId=58739642)

### [Fast KNN using  scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-knn-using-scikit-learn-intelex-for-mnist?scriptVersionId=58738635)

### [Fast SVC using scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-mnist?scriptVersionId=58739300)

### [Fast SVC using scikit-learn-intelex for NLP](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-nlp?scriptVersionId=58739339)