# AutoGluon with Intel® Extension for Scikit-learn* - Kaggle Tabular Playground Series - June 2021

This is an example how Intel(R) Extension for Scikit-learn improves AutoGluon performance for one of Kaggle Playground competitions ([Kaggle Tabular Playground Series - June 2021](https://www.kaggle.com/c/tabular-playground-series-jun-2021/overview)). Also, it might be applicable for other classification competitions with little changes.

[Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) enables optimized ML kernels from [Intel(R) oneAPI Data Analytics Library](https://github.com/oneapi-src/oneDAL) with just two lines of codes for common cases and zero lines for AutoGluon since it has autopatching if extension is installed.

**AutoGluon installation:**

In [None]:
!pip install autogluon.tabular[all] -q --progress-bar off

**Intel® Extension for Scikit-learn installation:**

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

In [None]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from timeit import default_timer as timer
import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

### Loading and analysis of data

In [None]:
competition_prefix = 'tabular-playground-series-jun-2021'

train_data = pd.read_csv(f'../input/{competition_prefix}/train.csv', index_col='id')
test_data = pd.read_csv(f'../input/{competition_prefix}/test.csv', index_col='id')
sample_submission = pd.read_csv(f'../input/{competition_prefix}/sample_submission.csv', index_col='id')

random_state = 42
train_data, valid_data = train_test_split(train_data, test_size=0.2, random_state=random_state)

print('Train data shape:', train_data.shape)
print('Valid data shape:', valid_data.shape)
print('Test data shape:', test_data.shape)

In [None]:
train_data.head(10)

In [None]:
train_data.describe()

In [None]:
label = 'target'
nuniques = train_data.drop([label], axis=1).nunique()
print('Maximum number of feature unique values:', nuniques.max())
print('Minimum number of feature unique values:', nuniques.min())

In [None]:
plt.figure(figsize=(8, 6))
sb.countplot(data=train_data, x=label, order=train_data[label].value_counts().index)
train_data[label].value_counts()

## AutoGluon with Intel® Extension for Scikit-learn

Lets run AutoGluon with installed Intel® Extension for Scikit-learn and default parameters except random state fixing and number of neighbors for kNN algorithm

In [None]:
from  autogluon.tabular import TabularPredictor


time_limit = 7200 # 2 hours time limit

# copy and modify default parameters from "fit" method
# https://auto.gluon.ai/stable/api/autogluon.predictor.html#autogluon.tabular.TabularPredictor.fit
# to fix random states and change n_neighbors parameter for KNN
hyperparameters = {
    'NN': {},
    'GBM': [
        {'extra_trees': True, 'seed': random_state, 'ag_args': {'name_suffix': 'XT'}},
        {},
        'GBMLarge',
    ],
    'CAT': {'random_seed': random_state},
    'XGB': {'seed': random_state},
    'FASTAI': {},
    'RF': [
        {'criterion': 'gini', 'random_state': random_state,
         'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'entropy', 'random_state': random_state,
         'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'mse', 'random_state': random_state,
         'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
    ],
    'XT': [
        {'criterion': 'gini', 'random_state': random_state,
         'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'entropy', 'random_state': random_state,
         'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'mse', 'random_state': random_state,
         'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
    ],
    'KNN': [
        {'weights': 'uniform', 'n_neighbors': 1000, 'ag_args': {'name_suffix': 'Unif'}},
        {'weights': 'distance', 'n_neighbors': 1000, 'ag_args': {'name_suffix': 'Dist'}},
    ],
}

t0 = timer()
autogluon_predictor = TabularPredictor(
    label=label,
    eval_metric="log_loss",
    learner_kwargs={'ignored_columns': ['id']}
).fit(
    train_data=train_data,
    tuning_data=valid_data,
    time_limit=time_limit,
    verbosity=0,
    hyperparameters=hyperparameters
)
t1 = timer()
fitting_time = t1 - t0

leaderboard = autogluon_predictor.leaderboard()

Pay attention to "Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)" messages in previous output.

In [None]:
leaderboard

As we can see from models leaderboard, gradient boosting is better for this task.

In [None]:
t0 = timer()
predictions = autogluon_predictor.predict_proba(test_data)
t1 = timer()
prediction_time = t1 - t0
predictions.columns = list(sample_submission.columns)
predictions.index = sample_submission.index
predictions.to_csv('tps_jun_2021_autogluon_submission.csv')

predictions.head()

In [None]:
print('Fitting time[s]:', round(fitting_time, 3))
print('Prediction time[s]:', round(prediction_time, 3))

Intel(R) Extension for Scikit-learn gives **1.45x speedup for all AutoGluon fitting time and 4.8x speedup for k-Nearest Neighbors algorithm prediction**.

To find default AutoGluon result, see [another notebook](https://www.kaggle.com/alex97andreev/tps-jun-default-autogluon).

In [None]:
!rm -rf AutogluonModels