# Command Line Patching: 

![Assets/SVC_results.png](Assets/SVC_results.png)
### Use nbconvert  patch_sklearn from command line

# Learning Objectives:

- Describe how to import and apply patch_sklearn()
- Describe how to import and apply unpatch_sklearn()
- Describe method & apply the patch to an entire python program
- Describe how to surgically unpatch specific optimized functions if needed
- Describe a patching strategy that ensures that the Intel Extensions for scikit-learn runs as fast or faster than the stock algorithms it replaces
- Apply patch methodology to speed up KNN on CovType dataset

# Steps

You will convert a Jupyter Notebook to a python file using "Jupyter nbconvert" and then apply the patch via the command line. Then you will run the patched python script to ensure that the patch has been applied.

* On your DevCloud instance:
1) click the blue + in the upper left of browser (the launcher) 

![Assets/NewLauncher.jpg](Assets/NewLauncher.jpg)

2) Scroll down in the launcher and Launch a Terminal


![Assets/LaunchTerminal.jpg](Assets/LaunchTerminal.jpg)


- In the terminal:
1) Change directories to our current folder as follows:
1) - cd "ai_learning_paths/ML using oneAPI/01_Introduction to Intel Extensions for Scikit-learn Patching/" 
1) Convert Jupyter Notebook to a python script using Jupyter nbconvert script as follows:
1) - **jupyter nbconvert --to script SampleSVM_Notebook.ipynb**
1) Run a patched version of the python code as follows:
1)  - **python -m sklearnex SampleSVM_Notebook.py** 
1)  - **python  SampleSVM_Notebook.py**
1)  - the first run (**patched globally**) should take about **11 seconds**
1)  - the second run(**unpatched**) takes **1 to 2 minutes**
1) Compare the times of execution of the two runs
1) compare the accuracy metrics of both runs

- Was there a significant difference in time or accuracy?

The above should run the python script and apply the sklearnex patch to the entire python file prior to executing the file


# *Real World* example KNN on CovType Dataset

### Compare timings of stock kmeans versus Intel Extension for Scikit-learn KNN using patch_sklean()

Below we will apply Intel Extension for Scikit learn to a use case on a CPU

Intel® Extension for Scikit-learn contains drop-in replacement functionality for the stock scikit-learn package. You can take advantage of the performance optimizations of Intel Extension for Scikit-learn by adding just two lines of code before the usual scikit-learn imports. Intel® Extension for Scikit-learn patching affects performance of specific Scikit-learn functionality.

### Data: covtype

We will use forest cover type dataset known as covtype and fetch the data from sklearn.datasets


Here we are **predicting forest cover type** from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.


Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

### Overview of procedure
In the below example we will train and predict kNN algorithm with Intel Extension for Scikit-learn for covtype dataset and calculate the CPU and wall clock time for training and prediction. Then in the next step we will unpatch the Intel extension for Scikit-learn and observe the time taken on the CPU for the same trainng and prediction.

### Fetch the Data

- [Back to Sections](#Back_to_Sections)


In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import pandas as pd
import time
import numpy as np
import numpy.ma as ma

connect4 = pd.read_csv('data/connect-4.data')

data = connect4.iloc[:,:42].replace(['x', 'o', 'b'], [0,1,2])

keep = .25 # amount of data to experiment with to keep times reasonable
subsetLen = int(keep*data.shape[0])

X = np.byte( data.iloc[:subsetLen,:].to_numpy() )
X = X[:subsetLen,:42]
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X)
enc.categories_

XOHE = np.short(enc.transform(X).toarray() )# X one hot encoded

Data_y = connect4.iloc[:,42].to_numpy()
#np.random.shuffle(Data_y)
y =  Data_y[:subsetLen] 

from sklearnex import patch_sklearn, unpatch_sklearn
patch_sklearn('train_test_split')  #surgically patch train_test_split
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(XOHE, y, train_size=0.80, test_size=0.20, random_state=101)

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [9]:
from sklearn.metrics import classification_report

def predict( linear ):
    import numpy as np
    time_patch_predict = time.time()
    y_pred = linear.predict(X_test)
    elapsed = time.time() - time_patch_predict
    return elapsed, y_pred

def fit():
    start = time.time()
    linear = svm.SVC(kernel='linear', C=100).fit(X_train, y_train)
    time_patch_fit =  time.time() - start
    return time_patch_fit, linear


In [10]:
from sklearn.metrics import classification_report

# Apply the patch_sklearn() function to this cell then run the cell and note the time:

###############################
## add patch here ##

patch_sklearn()

###############################
from sklearn import svm
time_fit, linear = fit()
time_predict, y_pred = predict(linear)
target_names = ['win', 'loss', 'draw']
print("file as is ")
print(classification_report(y_test, y_pred, target_names=target_names))
print('Elapsed time: {:.2f} sec'.format( time_fit + time_predict))

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


file as is 
              precision    recall  f1-score   support

         win       0.45      0.02      0.04       231
        loss       0.68      0.59      0.63       614
        draw       0.85      0.95      0.89      2533

    accuracy                           0.82      3378
   macro avg       0.66      0.52      0.52      3378
weighted avg       0.79      0.82      0.79      3378

Elapsed time: 11.21 sec


In [11]:
from sklearnex import patch_sklearn, unpatch_sklearn
from sklearn.metrics import classification_report

# UNPATCH ####

unpatch_sklearn("svc")

##############
from sklearn import svm
time_fit, linear = fit()
time_predict, y_pred = predict(linear)
target_names = ['win', 'loss', 'draw']
print("explicit unpatch ")
print(classification_report(y_test, y_pred, target_names=target_names))
print('Elapsed time: {:.2f} sec'.format( time_fit + time_predict))

explicit unpatch 
              precision    recall  f1-score   support

         win       0.45      0.02      0.04       231
        loss       0.68      0.59      0.63       614
        draw       0.85      0.95      0.89      2533

    accuracy                           0.82      3378
   macro avg       0.66      0.52      0.52      3378
weighted avg       0.79      0.82      0.79      3378

Elapsed time: 76.36 sec


Compare the times and accuracies of these two runs. 

Is the time versus accuracy trade off worth the effort to patch this function?

Reminder of how to find the list of functions available to patch

## List the underlying patched functions to its containg library

In [1]:
# return list of optimzed functions
from sklearnex import get_patch_names
get_patch_names()

['pca',
 'kmeans',
 'dbscan',
 'distances',
 'linear',
 'ridge',
 'elasticnet',
 'lasso',
 'logistic',
 'log_reg',
 'knn_classifier',
 'nearest_neighbors',
 'knn_regressor',
 'random_forest_classifier',
 'random_forest_regressor',
 'train_test_split',
 'fin_check',
 'roc_auc_score',
 'tsne',
 'logisticregression',
 'kneighborsclassifier',
 'nearestneighbors',
 'kneighborsregressor',
 'randomrorestclassifier',
 'randomforestregressor',
 'svr',
 'svc',
 'nusvr',
 'nusvc',
 'set_config',
 'get_config',
 'config_context']

## Use get_patch_map for more information

Below is how to get more information on which specific names to patch for surgical control

In [2]:
from sklearnex import get_patch_names, get_patch_map
get_patch_map()

{'pca': [[(<module 'sklearn.decomposition' from '/home/u78349/.local/lib/python3.9/site-packages/sklearn/decomposition/__init__.py'>,
    'PCA',
    daal4py.sklearn.decomposition._pca.PCA),
   None]],
 'kmeans': [[(<module 'sklearn.cluster' from '/home/u78349/.local/lib/python3.9/site-packages/sklearn/cluster/__init__.py'>,
    'KMeans',
    daal4py.sklearn.cluster._k_means_0_23.KMeans),
   None]],
 'dbscan': [[(<module 'sklearn.cluster' from '/home/u78349/.local/lib/python3.9/site-packages/sklearn/cluster/__init__.py'>,
    'DBSCAN',
    daal4py.sklearn.cluster._dbscan.DBSCAN),
   None]],
 'distances': [[(<module 'sklearn.metrics' from '/home/u78349/.local/lib/python3.9/site-packages/sklearn/metrics/__init__.py'>,
    'pairwise_distances',
    <function daal4py.sklearn.metrics._pairwise.daal_pairwise_distances(X, Y=None, metric='euclidean', n_jobs=None, force_all_finite=True, **kwds)>),
   None]],
 'linear': [[(<module 'sklearn.linear_model' from '/home/u78349/.local/lib/python3.9/sit

# Summary:

You have:

1) applied patching globally, by region in a cell, and surgically.
2) you have turned patching off
3) you are equpped to use any combination of patching strategy to control the maptching behavior of a given fucntion
4) Quiz?
5) - In lecture we learned that the sklearnex pairwise_distance only accepts metrics 'cosine' and 'correlation'. Assume you were dong pairwise_distance as follows:
5) - pairwise(distance(X, y)
6) - what is the default metric used?
7) - if you REQUIRED Euclidean distance, suggest a patching stratgey avoid having to change the call to pairwise distance but also get benefit of pathcing globally for all the rest of the notebook
    

# Notices & Disclaimers 

Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. 
*Other names and brands may be claimed as the property of others.