# 🚀 Optimizing Kaggle kernels using Intel(R) Extension for Scikit-learn

For classical machine learning algorithms, we often use the most popular Python library, scikit-learn. We use it to fit models and search for optimal parameters, but scikit-learn sometimes works for hours, if not days. Speeding up this process is something anyone who uses scikit-learn would be interested in.

I want to show you how to get results faster without changing the code. To do this, we will use another Python library, [scikit-learn-intelex](https://github.com/intel/scikit-learn-intelex). It accelerates scikit-learn and does not require you changing the code written for scikit-learn.

I will show you how to speed up your kernel **from 2h 26min to 6 minutes** without changes of your code! This is **25x** speedup

This kernel is based on [[TPS 2021-04] Support Vector Machines](https://www.kaggle.com/ekozyreff/tps-2021-04-support-vector-machines) and use same code with addition of scikit-learn-intelex

Speedup details:

|Case                     | Original time  | Patched time   | Speedup       |Original accuracy | Patched accuracy |
| :-----------------------| :------------: | :-------------:| :------------:|:----------------:| :---------------:|
|SVM RBF Train            | 10min 2s       | 38.5 s         | x15.6         | 0.7614 - local   | 0.7614 - local   |
|SVM RBF Predict          | 4min 51s       | 9.56 s         | x30.4         | 0.79062 - PL     | 0.79062 - PL     |
|SVM RBF 10 folds         | 2h 26min 43s   | 5min 49s       | x25.2           | 0.79078 - PL     | 0.79066 - PL     |

Note: actual run time depends on particular VM hardware provisioned for kernel - there are might be notisable fluctuation in time

Note2: we observe slightly lower accuracy for folded case - will be investigating this


# Installing scikit-learn-intelex

Package also avaialble in conda  - please refer to details https://github.com/intel/scikit-learn-intelex

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log

# Enable Intel(R) Extension for Scikit-learn
Do magic here - patching scikit-learn 

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

# Original code below
Only keep code relevant for final kfolds block

In [None]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split


In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv', index_col='PassengerId')
test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv', index_col='PassengerId')
submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv', index_col='PassengerId')

target = train.pop('Survived')

In [None]:
train.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [None]:
test_prepared = test.copy()
train_prepared = train.copy()

test_prepared['Age'].fillna((train['Age'].median()), inplace=True)
train_prepared['Age'].fillna((train['Age'].median()), inplace=True)

test_prepared['Fare'].fillna((train['Fare'].median()), inplace=True)
train_prepared['Fare'].fillna((train['Fare'].median()), inplace=True)

test_prepared['Embarked'].fillna('S', inplace=True)
train_prepared['Embarked'].fillna('S', inplace=True)

In [None]:
for col in ['Pclass', 'Sex', 'Embarked']:
    le = LabelEncoder()
    le.fit(train_prepared[col])
    train_prepared[col] = le.transform(train_prepared[col])
    test_prepared[col] = le.transform(test_prepared[col])

In [None]:
train_prepared.head()

In [None]:
train_prepared_scaled = train_prepared.copy()
test_prepared_scaled = test_prepared.copy()

scaler = StandardScaler()
scaler.fit(train_prepared)
train_prepared_scaled = scaler.transform(train_prepared_scaled)
test_prepared_scaled = scaler.transform(test_prepared_scaled)

train_prepared_scaled = pd.DataFrame(train_prepared_scaled, columns=train_prepared.columns)
test_prepared_scaled = pd.DataFrame(test_prepared_scaled, columns=train_prepared.columns)

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train_prepared_scaled, target, test_size=0.1, random_state=0)

In [None]:
%%time
svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.01)
svc_kernel_rbf.fit(X_train, y_train)
y_pred = svc_kernel_rbf.predict(X_valid)
accuracy_score(y_pred, y_valid)

In [None]:
%%time
final_pred = svc_kernel_rbf.predict(test_prepared_scaled)

# Comparing to original RBF case

Achived same accuracy in local scoring - 0.7614

Achived same accuracy in public leaderboard **0.79062**

Original training time: 10min 2s

Original predict time: 4min 51s



In [None]:
submission['Survived'] = np.round(final_pred).astype(int)
submission.to_csv('svc_kernel_rbf.csv')

In [None]:
%%time
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(train_prepared_scaled, target)):
    print("Running Fold {}".format(fold + 1))
    X_train, X_valid = pd.DataFrame(train_prepared_scaled.iloc[train_index]), pd.DataFrame(train_prepared_scaled.iloc[valid_index])
    y_train, y_valid = target.iloc[train_index], target.iloc[valid_index]
    svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.01)
    svc_kernel_rbf.fit(X_train, y_train)
    print("  Accuracy: {}".format(accuracy_score(y_valid, svc_kernel_rbf.predict(X_valid))))
    y_pred += svc_kernel_rbf.predict(test_prepared_scaled)

y_pred /= n_folds

print("")
print("Done!")

In [None]:
submission['Survived'] = np.round(y_pred).astype(int)
submission.to_csv('svc_kernel_rbf_10_folds.csv')