<br>
<h1 style = "font-size:25px ; font-weight : bold; color : #020296; text-align: center; border-radius: 10px 15px;"> 🚀 Fast Stacking with Intel(R) Extension for Scikit-learn  </h1>
<br>

For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. We use it to fit models and search for optimal parameters, but scikit-learn sometimes works for hours, if not days. Speeding up this process is something anyone who uses Scikit-learn would be interested in.

I want to show you how to get results faster without changing the code. To do this, we will use another Python library, **[Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex)**. It accelerates Scikit-learn and does not require you changing the code written for scikit-learn.

I will show you how to speed up your kernel from **5 hours to 1.5 hours** without changes of code!

# 🔨 Installing Intel(R) Extension for Scikit-learn

Let's try to use Intel(R) Extension for Scikit-learn. First, download it. Package also avaialble in conda - please refer to details https://github.com/intel/scikit-learn-intelex

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

In [None]:
from sklearnex import patch_sklearn, unpatch_sklearn

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# 📋 Reading data and splitting on training and validation datasets

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv')

y_train = train['target']
x_train = train.drop(['id','target'], axis=1)
x_test = test.drop(['id'], axis=1)

from sklearn.model_selection import train_test_split
x_train_sub, x_val, y_train_sub, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)
print(x_train_sub.shape, x_val.shape)

# 📊 Data preprocessing

## One-hot encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False).fit(pd.concat([x_train, x_test]))
x_train_sub_onehot = encoder.transform(x_train_sub)
x_val_onehot = encoder.transform(x_val)
x_test_onehot = encoder.transform(x_test)

## PCA

In [None]:
patch_sklearn()

from sklearn.decomposition import PCA
pca_full = PCA(random_state=0).fit(x_train_sub_onehot)

In [None]:
plt.rcParams["figure.figsize"] = (15,6)

fig, ax = plt.subplots()
xi = np.arange(1, x_train_sub_onehot.shape[1] + 1, step=1)
y = np.cumsum(pca_full.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y, color='b')

plt.xlabel('Number of Components')

plt.title('Explained variance')

plt.axhline(y=0.95, color='r', linestyle='--')
plt.text(2250, 0.85, '95% cut-off threshold', color='red', fontsize=16)

ax.grid(axis='x')
plt.show()

In [None]:
pca = PCA(n_components=650, random_state=0).fit(x_train_sub_onehot)
x_train_sub_pca = pca.transform(x_train_sub_onehot)
x_val_pca = pca.transform(x_val_onehot)
x_test_pca = pca.transform(x_test_onehot)

In [None]:
del x_train_sub_onehot
del x_val_onehot
del x_test_onehot

## Normalization

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(x_train_sub_pca)
x_train_sub_norm = scaler.transform(x_train_sub_pca)
x_val_norm = scaler.transform(x_val_pca)
x_test_norm = scaler.transform(x_test_pca)

In [None]:
del x_train_sub_pca
del x_val_pca
del x_test_pca

# 🔍 Defining model and parameters for search optimal model

The model is a stacking classifier with logistic regression, kNN, random forest, and a pipeline of QuantileTransformer and another logistic regression as a final estimator

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.preprocessing import QuantileTransformer

def get_stacking_classifier(C1=None,
                            n_neighbors=None,
                            n_estimators=None, min_samples_split=None, min_samples_leaf=None,
                            n_quantiles=None, C2=None):
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier

    log_reg = LogisticRegression(C=C1)
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    rf = RandomForestClassifier(n_estimators=n_estimators, min_samples_split=min_samples_split,
                                min_samples_leaf=min_samples_leaf, random_state=0)
    log_reg_quantile = Pipeline([
        ('quantile', QuantileTransformer(n_quantiles=n_quantiles, random_state=0)),
        ('logreg', LogisticRegression(C=C2))
    ])
    
    stacking_estimators = [
        ('log_reg', log_reg),
        ('knn', knn),
        ('rf', rf)
    ]
    
    return StackingClassifier(estimators=stacking_estimators, final_estimator=log_reg_quantile)

# ⚙️ Best parameters
This set of parameters was found by the search on the grid of parameters

In [None]:
best_params = {
    'C1': 0.0031483991304676337,
    'n_neighbors': 23,
    'n_estimators': 448,
    'min_samples_split': 10,
    'min_samples_leaf': 6,
    'n_quantiles': 3,
    'C2': 0.9808974699196531,
}

# 🚝 Fit model with Intel(R) Extension for Scikit-learn

In [None]:
patch_sklearn()

classifier = get_stacking_classifier(**best_params)
t0 = time.time()
classifier.fit(x_train_sub_norm, y_train_sub)
t1 = time.time()
y_pred = classifier.predict_proba(x_val_norm)
t2 = time.time()

In [None]:
from sklearn.metrics import log_loss
print(f'fit time: {t1-t0} sec')
print(f'predict_proba time: {t2-t1} sec')
print(f"Metric value: {log_loss(y_val, y_pred)}")

# 🚂 Fit model with original Scikit-learn

In [None]:
unpatch_sklearn()

classifier = get_stacking_classifier(**best_params)
t0 = time.time()
classifier.fit(x_train_sub_norm, y_train_sub)
t1 = time.time()
y_pred = classifier.predict_proba(x_val_norm)
t2 = time.time()

In [None]:
print(f'fit time: {t1-t0} sec')
print(f'predict_proba time: {t2-t1} sec')
print(f"Metric value: {log_loss(y_val, y_pred)}")

# 🎯 Fit final model and submit result

In [None]:
patch_sklearn()

classifier = get_stacking_classifier(**best_params)
classifier.fit(np.vstack((x_train_sub_norm, x_val_norm)), pd.concat([y_train_sub, y_val]))
y_pred = classifier.predict_proba(x_test_norm)

sample_submission[['Class_1','Class_2', 'Class_3', 'Class_4','Class_5','Class_6', 'Class_7', 'Class_8', 'Class_9']] = y_pred
sample_submission.to_csv('submission.csv', index=False)

# 📜 Conclusions

With Intel(R) Extension for Scikit-learn patching you can:

- Use your scikit-learn code for training and inference without modification;
- Train and predict scikit-learn models and get more time for experiments;
- Get the same quality of predictions.

*Please, upvote if you like.*