<a href="https://colab.research.google.com/github/ShaunakSen/ML_Deployment/blob/master/Benchmarking_XGBoost_with_GPU_and_HummingBird.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Benchmarking XGBoost o GPU

> Based on the YouTube tutorial by AIEngineering: https://www.youtube.com/watch?v=X7cKC6GgyxY

---



In [2]:
import xgboost as xgb
xgb.__version__

'0.90'

In [0]:
from __future__ import print_function
import sys, tempfile, urllib, os
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

Import a dataset called "covertype"

In [0]:
from sklearn.datasets import fetch_openml

covtyp = fetch_openml(name="covertype", version=4)

In [5]:
print (covtyp.DESCR)

**Author**: Jock A. Blackard, Dr. Denis J. Dean, Dr. Charles W. Anderson  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Covertype) - 1998  

This is the original version of the famous covertype dataset in ARFF format. 

**Covertype**  
Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System &#40;RIS&#41; data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types). 

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are 

In [6]:
print (covtyp.feature_names, len(covtyp.feature_names))

['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40'] 54


In [7]:
covtyp.data.shape

(581012, 54)

In [8]:
covtyp.target.shape

(581012,)

In [9]:
print ("Unique classes:", np.unique(covtyp.target))

Unique classes: ['1' '2' '3' '4' '5' '6' '7']


As we can see the data size is quite substantiable

It is a **multi-class classification problem**

In [10]:
!nvidia-smi

Sat Jun 13 16:21:56 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Ok, so we have a Tesla P100 GPU with 16 GB memory

Let us convert the data to a pandas Dataframe

In [0]:
cov_df = pd.DataFrame(data=np.c_[covtyp.data, covtyp.target], columns=covtyp.feature_names+['target'])

Lets look at the memory usage of the index + all the cols

`cov_df.memory_usage(index=True)` returns memory usage of all the cols, so we sum it up to get the total memory usage of the df (approximate value)

In [17]:
cov_df.memory_usage(index=True).sum()

255645408

The memory usage of the dataframe is around **255 MB**

In [18]:
cov_df.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area1,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4,Soil_Type1,Soil_Type2,Soil_Type3,Soil_Type4,Soil_Type5,Soil_Type6,Soil_Type7,Soil_Type8,Soil_Type9,Soil_Type10,Soil_Type11,Soil_Type12,Soil_Type13,Soil_Type14,Soil_Type15,Soil_Type16,Soil_Type17,Soil_Type18,Soil_Type19,Soil_Type20,Soil_Type21,Soil_Type22,Soil_Type23,Soil_Type24,Soil_Type25,Soil_Type26,Soil_Type27,Soil_Type28,Soil_Type29,Soil_Type30,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,target
0,2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5


In [20]:
print (cov_df.shape, covtyp.data.shape, covtyp.target.shape)

(581012, 55) (581012, 54) (581012,)


How is the target distributed

In [21]:
cov_df['target'].value_counts()

2    283301
1    211840
3     35754
7     20510
6     17367
5      9493
4      2747
Name: target, dtype: int64

In [22]:
print (cov_df.dtypes)

Elevation                             object
Aspect                                object
Slope                                 object
Horizontal_Distance_To_Hydrology      object
Vertical_Distance_To_Hydrology        object
Horizontal_Distance_To_Roadways       object
Hillshade_9am                         object
Hillshade_Noon                        object
Hillshade_3pm                         object
Horizontal_Distance_To_Fire_Points    object
Wilderness_Area1                      object
Wilderness_Area2                      object
Wilderness_Area3                      object
Wilderness_Area4                      object
Soil_Type1                            object
Soil_Type2                            object
Soil_Type3                            object
Soil_Type4                            object
Soil_Type5                            object
Soil_Type6                            object
Soil_Type7                            object
Soil_Type8                            object
Soil_Type9

Almost all vars are objects. XGBoost uses `DMatrix`, which needs int/float/bool format

We convert all cols to numeric:

In [23]:
for col in cov_df.columns:
    cov_df[col] = pd.to_numeric(cov_df[col])
print ("Done converting cols to numeric")

Done converting cols to numeric


In [24]:
print (cov_df.dtypes.values)

[dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('int64')]


Also in XGBoost target cols should start from 0:

In [0]:
cov_df['target'] = cov_df['target'] - 1

### Parallelzing in XGBoost

RFs which uses bagging can be parallelized easily. All trees within a bagging model are independent from another. 

But in boosting the next tree is dependent on the previous tree, so it is much harder to parallelize

But in XGBoost, while we cannot multiple models within an ensemble boosting model, we can parallelize the tree construction within indv models.. How much impact does this really have?

## Benchmarking with Hummingbird

> Based on the YouTube tutorial by AIEngineering: https://www.youtube.com/watch?v=XWD4F7LNvNE

---



In [11]:
!nvidia-smi

Sat Jun 13 16:21:59 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Ok, so we have a Tesla T4 GPU enabled with 15GB memeory