# Some tests with LightGBM

This notebook is based on: https://datascience.eu/machine-learning/1-what-is-light-gbm/

# Which algorithm takes the crown: Light GBM vs XGBOOST?

## preperation

<ul>
    <li>you need to install lightGBM first using pip install lightgbm</li>
    <li>Download the adult dataset which can be found <a href="http://archive.ics.uci.edu/ml/datasets/Adult">link</a>.<br>
        <b>Note:</b>Save the data set as .csv</li>
</ul>

## 1. what’s Light GBM?

Light GBM may be a fast, distributed, high-performance gradient boosting framework supported decision tree algorithm, used for ranking, classification and lots of other machine learning tasks.

Since it’s supported decision tree algorithms, it splits the tree leaf wise with the simplest fit whereas other boosting algorithms split the tree depth wise or level wise instead of leaf-wise. So when growing on an equivalent leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence leads to far better accuracy which may rarely be achieved by any of the prevailing boosting algorithms. Also, it’s surprisingly in no time , hence the word ‘Light’.

Before may be a diagrammatic representation by the manufacturers of the sunshine GBM to elucidate the difference clearly.

## 2. Advantages of sunshine GBM

Faster training speed and better efficiency: Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.

Lower memory usage: Replaces continuous values to discrete bins which end in lower memory usage.

Better accuracy than the other boosting algorithm: It produces far more complex trees by following leaf wise split approach instead of a level-wise approach which is that the main think about achieving higher accuracy. However, it can sometimes cause overfitting which may be avoided by setting the max_depth parameter.

Compatibility with Large Datasets: it’s capable of performing equally good with large datasets with a big reduction in training time as compared to XGBOOST.

Parallel learning supported.

Now before we dive head first into building our dawn GBM model, allow us to check out a number of the parameters of sunshine GBM to possess an understanding of its underlying procedures.

## 3. Important Parameters of sunshine GBM

<ul>
    <li>task : default value = train ; options = train , prediction; <br>
        Specifies the task we wish to perform which is either train or prediction.</li>
    <li>application: default=regression, type=enum, options= options:</li>
    <li>regression : perform regression task</li>
    <li>binary : Binary classification</li>
    <li>multiclass: Multiclass Classification</li>
    <li>lambdarank : lambdarank application</li>
    <li>data: type=string; training data , LightGBM will train from this data</li>
    <li>num_iterations: number of boosting iterations to be performed ; default=100; type=int</li>
    <li>num_leaves : number of leaves in one tree ; default = 31 ; type =int</li>
    <li>device : default= cpu ; options = gpu,cpu.<br>
        Device on which we would like to coach our model. Choose GPU for faster training.</li>
    <li>max_depth: Specify the max depth to which tree will grow.<br>
        This parameter is employed to affect overfitting.</li>
    <li>min_data_in_leaf: Min number of knowledge in one leaf.</li>
    <li>feature_fraction: default=1;<br> 
        specifies the fraction of features to be taken for every iteration</li>
    <li>bagging_fraction: default=1;<br> 
        specifies the fraction of knowledge to be used for every iteration and is usually wont to speed up the training and avoid overfitting.</li>
    <li>min_gain_to_split: default=.1;<br>
        min gain to perform splitting</li>
    <li>max_bin: max number of bins to bucket the feature values.</li>
    <li>min_data_in_bin: min number of knowledge in one bin</li>
    <li>num_threads: default=OpenMP_default, type=int;<br>
        Number of threads for Light GBM.</li>
    <li>label: type=string;<br>
        specify the label column</li>
    <li>categorical_feature: type=string;<br>
        specify the specific features we would like to use for training our model</li>
    <li>num_class: default=1; type=int;<br>
        used just for multi-class classification</li>
</ul>
Also, undergo this text explaining parameter tuning in XGBOOST intimately .

## 4. LightGBM vs XGBoost

So now let’s compare LightGBM with XGBoost by applying both the algorithms to a dataset then comparing the performance.

Here we are using dataset that contains the knowledge about individuals from various countries. Our target is to predict whether an individual makes 50k annually on basis of the opposite information available. Dataset consists of 32561 observations and 14 features describing individuals.

Go through the dataset to possess a correct intuition about predictor variables then that you simply could understand the code.

### on the roll

#### import the relevant liberaries

In [3]:
import numpy as np
import pandas as pd 
from pandas import Series, DataFrame 

#import lightgbm and xgboost
import lightgbm as lgb
import xgboost as xgb 

#### import the dataset & preprocess

In [5]:
data=pd.read_csv('adult.csv',header=None)

Assigning names to the columns 

In [6]:
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income'] 

glimpse of the dataset 

In [7]:
data.head() 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital_Status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#### Label Encoding our target variable 

In [10]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder() 
l.fit(data.Income) 
l.classes_ 

array([0, 1])

label encoding our target variable 

In [11]:
data.Income=Series(l.transform(data.Income))  
data.Income.value_counts() 

0    24720
1     7841
Name: Income, dtype: int64

One Hot Encoding of the Categorical features 

In [12]:
one_hot_workclass=pd.get_dummies(data.workclass)
one_hot_education=pd.get_dummies(data.education)
one_hot_marital_Status=pd.get_dummies(data.marital_Status)
one_hot_occupation=pd.get_dummies(data.occupation)
one_hot_relationship=pd.get_dummies(data.relationship)
one_hot_race=pd.get_dummies(data.race)
one_hot_sex=pd.get_dummies(data.sex)
one_hot_native_country=pd.get_dummies(data.native_country) 

removing categorical features 

In [15]:
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True) 

Merging one hot encoded features with our dataset ‘data’ 

In [16]:
data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_native_country],axis=1) 

removing dulpicate columns 

In [19]:
 _, i = np.unique(data.columns, return_index=True) 

data=data.iloc[:, i]
data

Unnamed: 0,10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,?,Adm-clerical,Amer-Indian-Eskimo,...,Wife,Without-pay,Yugoslavia,Income,age,capital_gain,capital_loss,education-num,fnlwgt,hours_per_week
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,39,2174,0,13,77516,40
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,50,0,0,13,83311,13
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,38,0,0,9,215646,40
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,53,0,0,7,234721,40
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,28,0,0,13,338409,40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,27,0,0,12,257302,38
32557,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,40,0,0,9,154374,40
32558,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,58,0,0,9,151910,40
32559,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,22,0,0,9,201490,20


Here our target variable is ‘Income’ with values as 1 or 0.  

In [18]:
data

Unnamed: 0,10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,?,Adm-clerical,Amer-Indian-Eskimo,...,Wife,Without-pay,Yugoslavia,Income,age,capital_gain,capital_loss,education-num,fnlwgt,hours_per_week
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,39,2174,0,13,77516,40
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,50,0,0,13,83311,13
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,38,0,0,9,215646,40
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,53,0,0,7,234721,40
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,28,0,0,13,338409,40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,27,0,0,12,257302,38
32557,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,40,0,0,9,154374,40
32558,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,58,0,0,9,151910,40
32559,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,22,0,0,9,201490,20


Separating our data into features dataset x and our target dataset y 

In [21]:
x=data.drop('Income',axis=1) 
y=data.Income 

Imputing missing values in our target variable 

In [25]:
y.fillna(y.mode()[0],inplace=True) 

### Test & train model

Now splitting our dataset into test and train 

In [26]:
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

Applying xgboost.The data is stored in a DMatrix object and label is used to define our outcome variable.

In [27]:
dtrain=xgb.DMatrix(x_train,label=y_train)
dtest=xgb.DMatrix(x_test)

setting parameters for xgboost

In [37]:
parameters={'max_depth':7, 'eta':1, 'verbosity':1,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}

<b>Please note:</b> the orginal statement was:<br>
<i>parameters={'max_depth':7, 'eta':1, 'silent':1,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}</i>. <br><br>
However I received the following error:<br><br>
<i>WARNING: C:\Users\Administrator\workspace\xgboost-win64_release_1.2.0\src\learner.cc:516: 
Parameters: { silent } might not be used.
This may not be accurate due to some parameters are only used in language bindings but passed down to XGBoost core.  Or some parameters are not used but slip through this verification. Please open an issue if you find above cases.</i><br><br>
This is due to an update from the xgboost side that deprecates the silent parameter, we should change to verbosity:0. Changing it to statement:<br><br>
<i>parameters={'max_depth':7, 'eta':1, 'verbosity':0,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}</i>.<br><Br> solved the issue.     

training our model 

In [39]:
num_round=50
from datetime import datetime 
start = datetime.now() 
xg=xgb.train(parameters,dtrain,num_round) 
stop = datetime.now()

Execution time of the model 

In [40]:
execution_time_xgb = stop-start 
execution_time_xgb

datetime.timedelta(seconds=2, microseconds=869828)

datetime.timedelta( , , ) representation => (days , seconds , microseconds) 
now predicting our model on test set 

In [41]:
ypred=xg.predict(dtest) 
ypred

array([0.38518557, 0.04324007, 0.35892898, ..., 0.04668105, 0.05638991,
       0.20022953], dtype=float32)

Converting probabilities into 1 or 0  

In [43]:
for i in range(0,9769):
    # setting threshold to .5 
    if ypred[i]>=.5:       
       ypred[i]=1 
    else: 
       ypred[i]=0  

calculating accuracy of our model 

In [44]:
from sklearn.metrics import accuracy_score 
accuracy_xgb = accuracy_score(y_test,ypred) 
accuracy_xgb

0.86651653188658

### Light GBM

In [45]:
train_data=lgb.Dataset(x_train,label=y_train)

setting parameters for lightgbm

In [49]:
param = {'num_leaves':150, 'objective':'binary','max_depth':7,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']

Here we have set max_depth in xgb and LightGBM to 7 to have a fair comparison between the two.

training our model using light gbm

In [50]:
num_round=50
start=datetime.now()
lgbm=lgb.train(param,train_data,num_round)
stop=datetime.now()

[LightGBM] [Info] Number of positive: 5501, number of negative: 17291
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 700
[LightGBM] [Info] Number of data points in the train set: 22792, number of used features: 89
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.241357 -> initscore=-1.145256
[LightGBM] [Info] Start training from score -1.145256


Execution time of the model

In [51]:
execution_time_lgbm = stop-start
execution_time_lgbm

datetime.timedelta(microseconds=344127)

predicting on test set & showing first 5 predictions

In [52]:
ypred2=lgbm.predict(x_test)
ypred2[0:5]

array([0.37287314, 0.02107162, 0.32087419, 0.13745559, 0.93993152])

converting probabilities into 0 or 1

In [53]:
for i in range(0,9769):
    # setting threshold to .5
    if ypred2[i]>=.5:       
       ypred2[i]=1
    else:  
       ypred2[i]=0

calculating accuracy

In [54]:
accuracy_lgbm = accuracy_score(ypred2,y_test)
accuracy_lgbm
y_test.value_counts()
from sklearn.metrics import roc_auc_score

calculating roc_auc_score for xgboost

In [55]:
auc_xgb =  roc_auc_score(y_test,ypred)
auc_xgb

0.7723047700568229

calculating roc_auc_score for light gbm. 

In [75]:
auc_lgbm comparison_dict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
print(auc_lgbm comparison_dict)

SyntaxError: invalid syntax (<ipython-input-75-91eaaa96acde>, line 1)

In [78]:
auc_lgbm = roc_auc_score(y_test,ypred2)
comparison_dict = {'accuracy score':(accuracy_lgbm,accuracy_xgb),'auc score':(auc_lgbm,auc_xgb),'execution time':(execution_time_lgbm,execution_time_xgb)}

In [79]:
comparison_dict

{'accuracy score': (0.8647763332992118, 0.86651653188658),
 'auc score': (0.7657448633387521, 0.7723047700568229),
 'execution time': (datetime.timedelta(microseconds=344127),
  datetime.timedelta(seconds=2, microseconds=869828))}

Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb. 

In [84]:
comparison_df = DataFrame(comparison_dict) 
comparison_df.index= ['LightGBM','xgboost'] 

comparison_dfelow properly.
Performance comparison

In [83]:
comparison_df

Unnamed: 0,accuracy score,auc score,execution time
LightGBM,0.864776,0.765745,0 days 00:00:00.344127
xgboost,0.866517,0.772305,0 days 00:00:02.869828


There has been only a small increase in accuracy and auc score by applying Light GBM over XGBOOST but there’s a big difference within the execution time for the training procedure. Light GBM is nearly 7 times faster than XGBOOST and may be a far better approach when handling large datasets.

This seems to be an enormous advantage once you are performing on large datasets in limited time competitions.

## 5. Tuning Parameters of sunshine GBM

Light GBM uses leaf wise splitting over depth wise splitting which enables it to converge much faster but also results in overfitting. So here may be a quick guide to tune the parameters in Light GBM.

For best fit
<ul>
    <li>num_leaves:<br>
        This parameter is employed to line the amount of leaves to be formed during a tree. Theoretically relation between num_leaves and max_depth is num_leaves= 2^(max_depth). However, this is often not an honest estimate just in case of sunshine GBM since splitting takes place leaf wise instead of depth wise. Hence num_leaves set must be smaller than 2^(max_depth) otherwise it’s going to cause overfitting. Light GBM doesn’t have an immediate relation between num_leaves and max_depth and hence the 2 must not be linked with one another.</li>
    <li>min_data_in_leaf:<br>
        it’s also one among the important parameters in handling overfitting. Setting its value smaller may cause overfitting and hence must be set accordingly. Its value should be hundreds to thousands of huge datasets.</li>
    <li>max_depth:<br>
        It specifies the utmost depth or level up to which tree can grow.</li>
</ul>
For faster speed

<ul>
    <li>bagging_fraction: is employed to perform bagging for faster results</li>
    <li>feature_fraction: Set fraction of the features to be used at each iteration</li>
    <li>max_bin: Smaller value of max_bin can save much time because it buckets the feature values in discrete bins which is computationally inexpensive.</li>
</ul>

For better accuracy

Use bigger training data
<ul>
    <li>num_leaves: Setting it to high value produces deeper trees with increased accuracy but cause overfitting. Hence its higher value isn’t preferred.</li>
    <li>max_bin : Setting it to high values has similar effect as caused by increasing value of num_leaves and also slower our training procedure.</li>
</ul>