# 🚀 Optimizing Kaggle kernels using Intel(R) Extension for Scikit-learn*
For classical machine learning algorithms, we often use the most popular Python library, scikit-learn. We use it to fit models and search for optimal parameters, but scikit-learn sometimes works for hours, if not days. Speeding up this process is something anyone who uses scikit-learn would be interested in.

I want to show you how to get results faster without changing the code. To do this, we will use another Python library, **[scikit-learn-intelex](https://github.com/intel/scikit-learn-intelex)**. It accelerates scikit-learn and does not require you changing the code written for scikit-learn.

I will show you how to speed up your kernel from **45 minutes to 2 minutes** without changes of your code!

# 📘 Problem Statement
We want to predict which Tweets are about real disasters and which ones are not.

Main steps in this kernel:

- Preprocessing data
- TF-IDF
- Search optmimal parameters for SVC algorithm from **scikit-learn** using **optuna**
- Search optmimal parameters for SVC algorithm from **scikit-learn-intelex** using **optuna**
- Fit final model and submit result

In [None]:
import pandas as pd
import numpy as np
import json
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import optuna

In [None]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
train.head()

# 📋 Preprocessing data

In [None]:
%%time
def join_list(tab):
    return " ".join(tab)

def transform_keyword(word) :
    return word.split('%20')

def transform_text(text):    
    text = re.sub(r'(&amp;|&gt;|&lt;)', " ", text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r'\t', ' ', text)
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'https?://\S+|www\.\S+', ' ',text)
    text = re.sub(r'@\S{0,}', ' USER ', text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r'\b(USER)( \1\b)+', r'\1', text)
    text = re.sub(r'([a-zA-Z])\1{1,}', r'\1\1', text)
    text = re.sub(r"htt\S{0,}", " ", text)
    text = re.sub(r"[^a-zA-Z\d\s]", " ", text)
    text = re.sub(r'^\d\S{0,}| \d\S{0,}| \d\S{0,}$', ' NUMBER ', text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r'\b(NUMBER)( \1\b)+', r'\1', text)
    text = re.sub(r"[0-9]", " ", text)
    text = text.strip()
    text = re.sub(r' via\s{1,}USER$', ' ', text)
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    return text

train.text = train.text.apply(join_list)
test.text = test.text.apply(join_list)

train.keyword = train.keyword.fillna(" ")
test.keyword = test.keyword.fillna(" ")

train.keyword = train.keyword.apply(transform_keyword).apply(join_list)
test.keyword = test.keyword.apply(transform_keyword).apply(join_list)

train.text = train.keyword + " " + train.text
test.text = test.keyword + " " + test.text

train.text = train.text.apply(transform_text)
test.text = test.text.apply(transform_text)

In [None]:
x_train = train.text
x_test = test.text
y_train = train.target

## 🔍 TF-IDF

In [None]:
%%time
tfv = TfidfVectorizer(strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words = 'english')

tfv.fit(pd.concat([x_train, x_test]))
xtrain_tfv =  tfv.transform(x_train) 
xtest_tfv =  tfv.transform(x_test) 

xtrain_tfv.shape, xtest_tfv.shape

In [None]:
x_train_sub, x_val, y_train_sub, y_val = train_test_split(xtrain_tfv, y_train, random_state = 42, test_size=0.20)

# ⏳ Search optmimal parameters for SVC algorithm from scikit-learn using optuna

In [None]:
def objective(trial):
    from sklearn.svm import SVC
    params = {
        'C': trial.suggest_loguniform('C', 1e-4, 1e4),
        'gamma': trial.suggest_loguniform('gamma', 1e-4, 1e4),
        'kernel': trial.suggest_categorical("kernel", ["linear", "rbf"])
    }

    svc = SVC(**params)
    svc.fit(x_train_sub, y_train_sub)
    return svc.score(x_val, y_val)

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="maximize",
                            pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=50, show_progress_bar=True)

In [None]:
print(f"Best Value: {study.best_trial.value}")
print(f"Best Params: {study.best_params}")

The search optimal parameters for SVM model took almost **45 minutes**.

# ⚡ Search optmimal parameters for SVC algorithm from scikit-learn-intelex using optuna

 Let's try to use scikit-learn-intelex. First, download it:

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log

To get optimizations, patch scikit-learn using Intel(R) Extension:

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

In [None]:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(seed=123),
                            direction="maximize",
                            pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=50, show_progress_bar=True)

This time, the search optimal parameters took a **little over two minutes**, which saved us **almost 45 minutes**. Let’s make sure that the quality has not changed!

In [None]:
print(f"Best Value: {study.best_trial.value}")
print(f"Best Params: {study.best_params}")

# 💡 Fit final model and submit result

In [None]:
from sklearn.svm import SVC
best_svc = SVC(**study.best_params)
best_svc.fit(xtrain_tfv, y_train)
y_pred_test = best_svc.predict(xtest_tfv)

In [None]:
sub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')
sub["target"] = y_pred_test
sub.to_csv("submission_scikit-learn-intelex.csv",index=False)

# 📜 Conclusions
With scikit-learn-intelex patching you can:

- Use your scikit-learn code for training and inference without modification.
- Train and predict scikit-learn models **up to 25 times faster**.
- Get the same quality of predictions as other tested frameworks.

*Please, upvote if you like.*