<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Tabular Playground Series - Dec 2021
</div>

<center><a><img src="https://i.ibb.co/PWvpT9F/header.png" alt="header" border="0" width=800 height=400 class="center"></a>

For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours. Speeding up this process is something anyone who uses Scikit-learn would be interested in.

I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, [**Intel® Extension for Scikit-learn***](https://github.com/intel/scikit-learn-intelex). It accelerates Scikit-learn and does not require you to change the code written for Scikit-learn.

I will show you how to **speed up** your kernel without changing your code!

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Importing Libraries and Data</center></h1>
</div>

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
from IPython.display import HTML
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt

### Reading Data

In [2]:
PATH_TRAIN      = '../input/tabular-playground-series-dec-2021/train.csv'
PATH_TEST       = '../input/tabular-playground-series-dec-2021/test.csv'
PATH_SUBMISSION = '../input/tabular-playground-series-dec-2021/sample_submission.csv'

In [3]:
train_data = pd.read_csv(PATH_TRAIN)
test_data  = pd.read_csv(PATH_TEST)
submission = pd.read_csv(PATH_SUBMISSION)

### Reduce DataFrame memory usage

Since data is quite big for Kaggle notebook instance RAM, we need to reduce memory usage by switching data types.

In [4]:
def reduce_memory_usage(df):
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    pass
        else:
            df[col] = df[col].astype('category')
    
    return df

In [5]:
train_data = reduce_memory_usage(train_data)
test_data  = reduce_memory_usage(test_data)

In [6]:
train_data = train_data.drop(['Id', 'Soil_Type7', 'Soil_Type15'], axis = 1)
test_data = test_data.drop(['Id', 'Soil_Type7', 'Soil_Type15'], axis = 1)

In [7]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 53 columns):
 #   Column                              Dtype
---  ------                              -----
 0   Elevation                           int16
 1   Aspect                              int16
 2   Slope                               int8 
 3   Horizontal_Distance_To_Hydrology    int16
 4   Vertical_Distance_To_Hydrology      int16
 5   Horizontal_Distance_To_Roadways     int16
 6   Hillshade_9am                       int16
 7   Hillshade_Noon                      int16
 8   Hillshade_3pm                       int16
 9   Horizontal_Distance_To_Fire_Points  int16
 10  Wilderness_Area1                    int8 
 11  Wilderness_Area2                    int8 
 12  Wilderness_Area3                    int8 
 13  Wilderness_Area4                    int8 
 14  Soil_Type1                          int8 
 15  Soil_Type2                          int8 
 16  Soil_Type3                          

Collect garbage to reduce memory usage

In [8]:
import gc

gc.collect()

96

### Intel® Extension for Scikit-learn installation:

In [9]:
!pip install scikit-learn-intelex -q --progress-bar off > /dev/null 2>&1

### Accelerate Scikit-learn with two lines of code:

In [10]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


Setup logging to track accelerated cases:

In [11]:
import logging

logger = logging.getLogger()
fh     = logging.FileHandler('log.txt')

fh.setLevel(10)
logger.addHandler(fh)

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Feature importance</center></h1>
</div>

One of the most basic questions we might ask of a model is: What features have the biggest impact on predictions?

This concept is called feature importance.

There are multiple ways to measure feature importance. In this kernel we consider permutation importance using library ELI5.

In [12]:
X, y = train_data.drop(['Cover_Type'], axis = 1), train_data['Cover_Type']

In [13]:
from sklearn.model_selection import train_test_split
from timeit import default_timer as timer

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.1, random_state = 42)

### ELI5

ELI5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available.

In [14]:
import eli5
from eli5.sklearn import PermutationImportance
from timeit import default_timer as timer
from sklearn.ensemble import RandomForestClassifier

In [15]:
timeFirstI  = timer()
modelRF     = RandomForestClassifier(random_state = 42).fit(X_train, y_train)
perm        = PermutationImportance(modelRF, random_state = 42).fit(X_val, y_val)
timeSecondI = timer()

In [16]:
print("Total time with Intel Extension: {} seconds".format(timeSecondI - timeFirstI))

Total time with Intel Extension: 4236.698096521 seconds


In [17]:
eli5.show_weights(perm, feature_names = X.columns.tolist())

Weight,Feature
0.4542  ± 0.0012,Elevation
0.0481  ± 0.0006,Horizontal_Distance_To_Roadways
0.0326  ± 0.0004,Horizontal_Distance_To_Fire_Points
0.0247  ± 0.0006,Wilderness_Area3
0.0199  ± 0.0001,Wilderness_Area4
0.0181  ± 0.0004,Vertical_Distance_To_Hydrology
0.0108  ± 0.0003,Horizontal_Distance_To_Hydrology
0.0087  ± 0.0003,Soil_Type39
0.0075  ± 0.0001,Soil_Type38
0.0064  ± 0.0002,Wilderness_Area1


In [18]:
pi_features = eli5.explain_weights_df(perm, feature_names = X_train.columns.tolist())
pi_features = pi_features.loc[pi_features['weight'] >= 0.0001]['feature'].tolist()

In [19]:
pi_features[:5]

['Elevation',
 'Horizontal_Distance_To_Roadways',
 'Horizontal_Distance_To_Fire_Points',
 'Wilderness_Area3',
 'Wilderness_Area4']

In [20]:
X_trainPI = X_train.loc[:, pi_features]
X_valPI   = X_val.loc[:, pi_features]

In [21]:
X_trainPI[:5]

Unnamed: 0,Elevation,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Wilderness_Area3,Wilderness_Area4,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Hydrology,Soil_Type39,Soil_Type38,Wilderness_Area1,...,Soil_Type27,Soil_Type34,Soil_Type21,Soil_Type28,Soil_Type9,Hillshade_Noon,Soil_Type29,Wilderness_Area2,Soil_Type4,Soil_Type25
1168622,3338,1486,1778,1,0,1,241,0,0,0,...,0,0,0,0,0,206,0,0,0,0
1528343,3306,1387,330,0,0,20,218,0,0,1,...,0,0,0,0,0,201,0,0,0,0
159785,2984,1860,2348,0,0,0,893,0,0,1,...,0,0,0,0,0,224,0,0,0,0
2933315,3239,3030,2065,0,0,117,239,0,0,0,...,0,0,0,0,0,209,0,0,0,0
430775,2992,2033,1180,1,0,183,537,0,0,0,...,0,0,0,0,0,246,0,0,0,0


### Accelerated functions:

In [22]:
!cat log.txt | grep 'running accelerated version' | sort | uniq

sklearn.ensemble.RandomForestClassifier.fit: running accelerated version on CPU
sklearn.ensemble.RandomForestClassifier.predict: running accelerated version on CPU
sklearn.model_selection.train_test_split: running accelerated version on CPU


### Default Scikit-learn

In [23]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [24]:
import eli5
from eli5.sklearn import PermutationImportance
from timeit import default_timer as timer
from sklearn.ensemble import RandomForestClassifier

In [25]:
timeFirstD  = timer()
modelRF     = RandomForestClassifier(random_state = 42).fit(X_train, y_train)
perm        = PermutationImportance(modelRF, random_state = 42).fit(X_val, y_val)
timeSecondD = timer()

In [26]:
print("Total time with default Scikit-learn: {} seconds".format(timeSecondD - timeFirstD))

Total time with default Scikit-learn: 5964.087305736 seconds


In [27]:
eli5.show_weights(perm, feature_names = X.columns.tolist())

Weight,Feature
0.4536  ± 0.0013,Elevation
0.0468  ± 0.0006,Horizontal_Distance_To_Roadways
0.0311  ± 0.0004,Horizontal_Distance_To_Fire_Points
0.0229  ± 0.0006,Wilderness_Area3
0.0188  ± 0.0001,Wilderness_Area4
0.0173  ± 0.0003,Vertical_Distance_To_Hydrology
0.0100  ± 0.0002,Horizontal_Distance_To_Hydrology
0.0086  ± 0.0003,Soil_Type39
0.0083  ± 0.0003,Wilderness_Area1
0.0073  ± 0.0001,Soil_Type38


In [28]:
eli5_speedup = round((timeSecondD - timeFirstD) / (timeSecondI - timeFirstI), 2)
HTML(f'<h2>ELI5 speedup: {eli5_speedup}x</h2>'
     f'(from {round((timeSecondD - timeFirstD), 2)} to {round((timeSecondI - timeFirstI), 2)} seconds)')

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Catboost</center></h1>
</div>

In [29]:
test_data = test_data.loc[:, pi_features]

In [30]:
from catboost import CatBoostClassifier

cat_params = {
    'iterations': 20000,
    'depth': 7,
    'task_type' : 'GPU',
    'l2_leaf_reg': 5,
    'eval_metric': 'Accuracy',
}

cat = CatBoostClassifier(**cat_params)
cat.fit(X_trainPI, y_train, eval_set=(X_valPI, y_val))

0:	learn: 0.9029700	test: 0.9037500	best: 0.9037500 (0)	total: 52.9ms	remaining: 17m 37s
1:	learn: 0.9026928	test: 0.9036300	best: 0.9037500 (0)	total: 99.7ms	remaining: 16m 37s
2:	learn: 0.9049117	test: 0.9057725	best: 0.9057725 (2)	total: 147ms	remaining: 16m 16s
3:	learn: 0.9029819	test: 0.9038150	best: 0.9057725 (2)	total: 193ms	remaining: 16m 6s
4:	learn: 0.9052153	test: 0.9060800	best: 0.9060800 (4)	total: 246ms	remaining: 16m 22s
5:	learn: 0.9053919	test: 0.9062850	best: 0.9062850 (5)	total: 304ms	remaining: 16m 53s
6:	learn: 0.9055700	test: 0.9064450	best: 0.9064450 (6)	total: 350ms	remaining: 16m 39s
7:	learn: 0.9057889	test: 0.9066475	best: 0.9066475 (7)	total: 395ms	remaining: 16m 27s
8:	learn: 0.9058767	test: 0.9066775	best: 0.9066775 (8)	total: 442ms	remaining: 16m 22s
9:	learn: 0.9057908	test: 0.9066150	best: 0.9066775 (8)	total: 487ms	remaining: 16m 14s
10:	learn: 0.9057833	test: 0.9066050	best: 0.9066775 (8)	total: 534ms	remaining: 16m 10s
11:	learn: 0.9057778	test: 0.9

<catboost.core.CatBoostClassifier at 0x7fac320ea590>

In [31]:
predictions = cat.predict(test_data)
submission['Cover_Type'] = predictions
predictions[:5]

array([[2],
       [2],
       [2],
       [2],
       [2]])

In [32]:
submission.to_csv("submission.csv", index = False)

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Conclusion</center></h1>
</div>

**Intel® Extension for Scikit-learn** gives you opportunities to:
* Use your Scikit-learn code for training and inference without modification.
* Get speed up your kernel

*Please upvote if you liked it.*

<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Other notebooks with sklearnex usage</center></h1>
</div>

### [[predict sales] Stacking with scikit-learn-intelex](https://www.kaggle.com/alexeykolobyanin/predict-sales-stacking-with-scikit-learn-intelex)

### [[TPS-Aug] NuSVR with Intel Extension for Sklearn](https://www.kaggle.com/alexeykolobyanin/tps-aug-nusvr-with-intel-extension-for-sklearn)

### [Using scikit-learn-intelex for What's Cooking](https://www.kaggle.com/kppetrov/using-scikit-learn-intelex-for-what-s-cooking?scriptVersionId=58739642)

### [Fast KNN using  scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-knn-using-scikit-learn-intelex-for-mnist?scriptVersionId=58738635)

### [Fast SVC using scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-mnist?scriptVersionId=58739300)

### [Fast SVC using scikit-learn-intelex for NLP](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-nlp?scriptVersionId=58739339)

### [Fast AutoML with Intel Extension for Scikit-learn](https://www.kaggle.com/lordozvlad/fast-automl-with-intel-extension-for-scikit-learn)

### [[Titanic] AutoML with Intel Extension for Sklearn](https://www.kaggle.com/lordozvlad/titanic-automl-with-intel-extension-for-sklearn)