# <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:180%; text-align:center"> Let's Speed Up! </p>

This notebook deals with the Classification task based on the "[Tabular Competition April - 2022 Data](https://www.kaggle.com/c/tabular-playground-series-apr-2022)".
* The aim of this notebook is to implement modin pandas which speeds up in loading data compared to Pandas.
* This notebook also aims to implement fast modelling using Intel® Extension for Scikit-learn which speeds up the mdelling and training compared to Scikit-learn.

## **Before jumping inside the code, I would sincerely request kagglers to upvote the notebook if you find this useful ;)**

## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center">Table of Contents</p> <a href= '#Table of Contents'></a>

* [1. Introduction](#1)
* [2. Importing libraries](#2)
    * [2.1 Pandas vs Modin.Pandas](#2.1)
* [3. Data visualization 📊](#3)
* [4. Feature engineering](#4)
* [5. Modeling 🤖](#5)
* [6. Conclusions 📝](#6)
* [7. References](#7)

<a id='1'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center"> 1. Introduction</p> <a href= '#Introduction'></a>

The modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.

<center><a><img src="https://modin.readthedocs.io/en/stable/_static/MODIN_ver2.png" alt="header" border="0" width=300 height=200 class="center"></a>

**Intel® Extension for Scikit-learn**

With Intel® Extension for Scikit-learn*, accelerating Scikit-learn applications is done, still full conformance is assured with all Scikit-Learn APIs and algorithms. Intel® Extension for Scikit-learn* is a free software AI accelerator that brings over 10-100X acceleration across a variety of applications.Intel® Extension for Scikit-learn* offers a way to accelerate existing scikit-learn code. The acceleration is achieved through patching: replacing the stock scikit-learn algorithms with their optimized versions provided by the extension.

<center><a><img src="https://miro.medium.com/max/1400/1*loqTWz8bcVAvVhmE1wUDVA.png" alt="header" border="0" width=300 height=200 class="center"></a>

<a id='2'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center"> 2. Importing Libraries</p> <a href= '#Importing Libraries'></a>

## Let's install and Import all the necessary libraries

In [None]:
!pip install scikit-learn-intelex

In [None]:
!pip install modin

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import gc
from IPython.display import HTML
warnings.filterwarnings("ignore")

from timeit import default_timer as timer
import matplotlib.pyplot as plt

<a id='2.1'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:100%; text-align:center"> 2.1 Pandas vs Modin.Pandas</p> <a href= '#Importing Libraries'></a>

## Reading Data using Pandas

In [None]:
PATH_TRAIN      = '../input/tabular-playground-series-apr-2022/train.csv'
PATH_TEST       = '../input/tabular-playground-series-apr-2022/test.csv'
PATH_LABELS     = '../input/tabular-playground-series-apr-2022/train_labels.csv'
PATH_SUBMISSION = '../input/tabular-playground-series-apr-2022/sample_submission.csv'

In [None]:
tPandasF = timer()
train = pd.read_csv(PATH_TRAIN)
test  = pd.read_csv(PATH_TEST)
train_labels = pd.read_csv(PATH_LABELS)
submission = pd.read_csv(PATH_SUBMISSION)
tPandasS = timer()

In [None]:
print("Data reading with default pandas time: {}".format(tPandasS - tPandasF))

## Reading Data using Modin Pandas

In [None]:
import modin.pandas as pd
import ray
ray.init()

In [None]:
tModinF = timer()
train = pd.read_csv(PATH_TRAIN)
test  = pd.read_csv(PATH_TEST)
train_labels = pd.read_csv(PATH_LABELS)
submission = pd.read_csv(PATH_SUBMISSION)
tModinS = timer()

In [None]:
print("Data reading with Modin time: {}".format(tModinS - tModinF))

In [None]:
modin_speedup = round((tPandasS - tPandasF) / (tModinS - tModinF), 2)
HTML(f'<h2>Reading data speedup: {modin_speedup}x</h2>'
     f'(from {round((tPandasS - tPandasF), 2)} to {round((tModinS - tModinF), 2)} seconds)')

<a id='3'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center"> 3. Data Visualization</p> <a href= '#Data Visualization'></a>

## Let's get some basic insights

In [None]:
display(train.shape)
display(test.shape)
display(train_labels.shape)
display(submission.shape)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train_labels.head()

In [None]:
train.describe().T.style.bar(subset=['mean'], color='yellow')\
                            .background_gradient(subset=['std'], cmap='Spectral')

In [None]:
test.describe().T.style.bar(subset=['mean'], color='yellow')\
                            .background_gradient(subset=['std'], cmap='Spectral_r')

In [None]:
def zerodata(zero_data):
  fig, ax = plt.subplots(1,1,figsize=(12, 20))
  ax.barh(zero_data.index, 100, color='grey', height=0.6)
  barh_label = ax.barh(zero_data.index, zero_data, color='lightblue', height=0.6)
  ax.bar_label(barh_label, fmt='%.01f %%', color='black')
  ax.spines[['left', 'bottom']].set_visible(False)
  ax.set_xticks([])
  ax.set_title('# of Zeros (by feature)', loc='center', fontweight='bold', fontsize=15)    
  plt.show()

In [None]:
zero_data_train = ((train.iloc[:,:54]==0).sum() / len(train) * 100)[::-1]
zerodata(zero_data_train)

In [None]:
fig, ax = plt.subplots(figsize=(12 , 12))
corr = train.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr,
        square=True, center=0, linewidth=0.2,
        cmap=sns.diverging_palette(250, 20, as_cmap=True),
        mask=mask, ax=ax) 
ax.set_title('Feature Correlation', loc='left', fontweight='bold')
plt.show()

<a id='4'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center"> 4. Feature Engineering</p> <a href= '#Feature Engineering'></a>

In [None]:
sensor = ['00','01','02','03','04','05','06','07','08','09','10','11','12']

drop_columes = []
for i in sensor:
    drop_columes.append(f"sensor_{i}")
    
drop_columes.append("step")

In [None]:
def feature_engineer(df):
    df_copy = df.copy()
    for i in sensor:
        mean_value = df.groupby(['sequence','subject'])[f"sensor_{i}"].mean()
        mean_value = mean_value.rename(f"sensor_{i}_mean")
        
        
        std_value  = df.groupby(['sequence','subject'])[f"sensor_{i}"].std()
        std_value  = std_value.rename(f"sensor_{i}_std")

        skew_value  = df.groupby(['sequence','subject'])[f"sensor_{i}"].skew()
        skew_value  = skew_value.rename(f"sensor_{i}_skew")

        
        
        max_value  = df.groupby(['sequence','subject'])[f"sensor_{i}"].max()
        max_value  = max_value.rename(f"sensor_{i}_max")

        min_value  = df.groupby(['sequence','subject'])[f"sensor_{i}"].min()
        min_value  = min_value.rename(f"sensor_{i}_min")
        
        



        df_copy = df_copy.merge(mean_value, left_on=['sequence', 'subject'], right_index=True)
        df_copy = df_copy.merge(std_value,  left_on=['sequence', 'subject'], right_index=True)
        df_copy = df_copy.merge(skew_value, left_on=['sequence', 'subject'], right_index=True)
        df_copy = df_copy.merge(max_value,  left_on=['sequence', 'subject'], right_index=True)
        df_copy = df_copy.merge(min_value,  left_on=['sequence', 'subject'], right_index=True)
    
    df_copy = df_copy.drop(drop_columes, axis=1)
    df_copy = df_copy[::60]
    return df_copy

In [None]:
train = feature_engineer(train)
test =  feature_engineer(test)

<a id='5'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center"> 5. Modelling</p> <a href= '#Modelling'></a>

In [None]:
X_train = train.drop(["sequence", "subject"], axis=1).reset_index(drop=True)
y_train = train_labels.drop(["sequence"], axis=1)
X_test  = test.drop(["sequence", "subject"], axis=1).reset_index(drop=True)

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

In [None]:
import logging

logger = logging.getLogger()
fh = logging.FileHandler('log.txt')
fh.setLevel(10)
logger.addHandler(fh)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def bo_params_rf(max_samples, max_features):  
    params = {
        'max_samples' : max_samples,
        'max_features' : max_features,
    }
    
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)
    
    score = accuracy_score(y_train, clf.predict(X_train))
    
    return score

In [None]:
from bayes_opt import BayesianOptimization
rf_bo = BayesianOptimization(bo_params_rf, {
                                             'max_samples': (0.5, 0.9),
                                             'max_features':(0.5, 0.9)
                                            })

In [None]:
results = rf_bo.maximize(n_iter = 2, init_points = 2, acq = 'ei')

## RandomForest with optimized Scikit-learn

In [None]:
params = rf_bo.max['params']

slfOpt = RandomForestClassifier(**params, n_estimators = 500, random_state = 42)

tFO = timer()
slfOpt.fit(X_train, y_train)
tSO = timer()

In [None]:
print("Total fitting Random Forest time with optimized Scikit-learn: {} seconds".format(tSO - tFO))

## RandomForest with default Scikit-learn

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from sklearn.ensemble import RandomForestClassifier

params = rf_bo.max['params']

slf = RandomForestClassifier(**params, n_estimators = 500, random_state = 42)

tFD = timer()
slf.fit(X_train, y_train)
tSD = timer()

In [None]:
print("Total fitting Random Forest time with default Scikit-learn: {} seconds".format(tSD - tFD))

In [None]:
rf_speedup = round((tSD - tFD) / (tSO - tFO), 2)
HTML(f'<h2>RandomForest speedup: {rf_speedup}x</h2>'
     f'(from {round((tSD - tFD), 2)} to {round((tSO - tFO), 2)} seconds)')

## Prediction

In [None]:
predictions = slfOpt.predict(X_test)
submission['state'] = predictions
submission[:5]

In [None]:
submission.to_csv("submission.csv", index = False)

<a id='6'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center"> 6. Conclusion</p> <a href= '#Conclusion'></a>

It can be seen that, using Modin.Pandas leads to **3x** times faster data loading when compared to Pandas. And, using Intel® Extension for Scikit-learn leads upto **12x** times faster modelling and training when compared to the default Scikit-learn. 

<a id='7'></a>
## <p style="background-color:#ADD8E6; font-family:newtimeroman; font-size:120%; text-align:center"> 7. References</p> <a href= '#References'></a>

I would like to sincerely thank [Devlikamov Vlad](https://www.kaggle.com/lordozvlad/code) for your notebooks on Intel® Extension for Scikit-learn which helped as a reference for this notebook. I would also like to extend my thanks to [Shoma Tateno](https://www.kaggle.com/shoooono) for the feature engineering technique.

## **If you find this notebook usefull kindly UPVOTE this notebook, hope that would really encourage me ;)**