# 🚀 Optimizing Kaggle kernels using Intel(R) Extension for Scikit-learn

For classical machine learning algorithms, we often use the most popular Python library, scikit-learn. We use it to fit models and search for optimal parameters, but scikit-learn sometimes works for hours, if not days. Speeding up this process is something anyone who uses scikit-learn would be interested in.

I want to show you how to get results faster without changing the code. To do this, we will use another Python library, [scikit-learn-intelex](https://github.com/intel/scikit-learn-intelex). It accelerates scikit-learn and does not require you changing the code written for scikit-learn.

While SVM is pretty slow in stock scikit-learn, with Intel Extension it can be now used for regular iterative work

This kernel is based on [[TPS 2021-04] Support Vector Machines](https://www.kaggle.com/ekozyreff/tps-2021-04-support-vector-machines) and feature preprocessing from [Ensemble-learning meta-classifier for stacking](https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking) as best result public kernel at the moment

# Installing scikit-learn-intelex

Package also avaialble in conda  - please refer to details https://github.com/intel/scikit-learn-intelex

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log

# Enable Intel(R) Extension for Scikit-learn
Do magic here - patching scikit-learn 

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

# Original code below
Keep code mix from different kernels

In [None]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import optuna


In [None]:
RANDOM_SEED = 2021
PROBAS = True
FOLDS = 5
N_ESTIMATORS = 1000

TARGET = 'Survived'

# Loading data

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv', index_col='PassengerId')
# Pseudo labels taken from great BIZEN notebook: https://www.kaggle.com/hiro5299834/tps-apr-2021-pseudo-labeling-voting-ensemble
pseudo_labels = pd.read_csv("../input/tps-apr-2021-label/pseudo_label.csv")
test[TARGET] = pseudo_labels[TARGET]
all_df = pd.concat([train, test]).reset_index(drop=True)

target = train.pop('Survived')

# Feature engeenring

In [None]:
# Age fillna with mean age for each class
all_df['Age'] = all_df['Age'].fillna(all_df['Age'].mean())

# Cabin, fillna with 'X' and take first letter
all_df['Cabin'] = all_df['Cabin'].fillna('X').map(lambda x: x[0].strip())

# Ticket, fillna with 'X', split string and take first split 
all_df['Ticket'] = all_df['Ticket'].fillna('X').map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')

# Fare, fillna with mean value
fare_map = all_df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()
all_df['Fare'] = all_df['Fare'].fillna(all_df['Pclass'].map(fare_map['Fare']))
all_df['Fare'] = np.log1p(all_df['Fare'])

# Embarked, fillna with 'X' value
all_df['Embarked'] = all_df['Embarked'].fillna('X')

# Name, take only surnames
all_df['Name'] = all_df['Name'].map(lambda x: x.split(',')[0])

In [None]:
all_df.head(5)

In [None]:
label_cols = ['Name', 'Ticket', 'Sex']
onehot_cols = ['Cabin', 'Embarked']
numerical_cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [None]:
def label_encoder(c):
    le = LabelEncoder()
    return le.fit_transform(c)

scaler = StandardScaler()

onehot_encoded_df = pd.get_dummies(all_df[onehot_cols])
label_encoded_df = all_df[label_cols].apply(label_encoder)
numerical_df = pd.DataFrame(scaler.fit_transform(all_df[numerical_cols]), columns=numerical_cols)
target_df = all_df[TARGET]

all_df = pd.concat([numerical_df, label_encoded_df, onehot_encoded_df, target_df], axis=1)

In [None]:
all_df.head(5)

In [None]:
all_df_scaled = all_df.drop([TARGET], axis = 1).copy()

scaler = StandardScaler()
scaler.fit(all_df.drop([TARGET], axis = 1))
all_df_scaled = scaler.transform(all_df_scaled)

all_df_scaled = pd.DataFrame(all_df_scaled, columns=all_df.drop([TARGET], axis = 1).columns)

In [None]:
all_df_scaled.head(5)

In [None]:
X = all_df_scaled
y = all_df[TARGET]

print (f'X:{X.shape} y: {y.shape} \n')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = RANDOM_SEED)
print (f'X_train:{X_train.shape} y_train: {y_train.shape}')
print (f'X_test:{X_test.shape} y_test: {y_test.shape}')

test = all_df_scaled[len(train):]
print (f'test:{test.shape}')

# Single SVM run
Running single SVM prediction

In [None]:
%%time
svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.779481782160288, gamma=0.10264575666119422 )
svc_kernel_rbf.fit(X_train, y_train)
y_pred = svc_kernel_rbf.predict(X_test)
accuracy_score(y_pred, y_test)

In [None]:
%%time
final_pred = svc_kernel_rbf.predict(test)

In [None]:
submission['Survived'] = np.round(final_pred).astype(int)
submission.to_csv('svc_kernel_rbf.csv')

# Hyperparams selection and Kfolds
As we have pretty fast SVM now - we can try running optune for params search and KFolds for final submission
I using pretty small range for optune serach and less iterations as i've already validate it throught many other runs

In [None]:
def objective(trial):
    from sklearn.svm import SVC
    params = {
        'C': trial.suggest_loguniform('C', 0.01, 0.1),
        'gamma': trial.suggest_categorical('gamma', ["auto"]),
        'kernel': trial.suggest_categorical("kernel", ["rbf"])
    }

    svc = SVC(**params)
    svc.fit(X_train, y_train)
    return svc.score(X_test, y_test)

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="maximize",
                            pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=5, show_progress_bar=True)

In [None]:
print(f"Best Value from optune: {study.best_trial.value}")
print(f"Best Params from optune: {study.best_params}")

if study.best_trial.value >= 0.88515:
    best_value = study.best_params
else:
    best_value = {'C': 0.779481782160288, 'gamma': 0.10264575666119422, 'kernel': 'rbf'}
    print(f"Using precalculated best params instead: {best_value}")
    

In [None]:
%%time
n_folds = 20
kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(X, y)):
    print("Running Fold {}".format(fold + 1))
    X_train, X_valid = pd.DataFrame(X.iloc[train_index]), pd.DataFrame(X.iloc[valid_index])
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    svc_kernel_rbf = SVC(**best_value)
    svc_kernel_rbf.fit(X_train, y_train)
    print("  Accuracy: {}".format(accuracy_score(y_valid, svc_kernel_rbf.predict(X_valid))))
    y_pred += svc_kernel_rbf.predict(test)

y_pred /= n_folds

print("")
print("Done!")

In [None]:
submission['Survived'] = np.round(y_pred).astype(int)
submission.to_csv('svc_kernel_rbf_10_folds_optune.csv')

# 📜 Conclusions

With scikit-learn-intelex patching you can:

*     Use your scikit-learn code for training and inference without modification.
*     Train and predict scikit-learn models and get more time for experiments
*     Get the same quality of predictions

*Please, upvote if you like.*